RegEx-Pattern to find every Character of each Language

Hi,

I would Love to find every Word within a String. Could use \w* but this will only Match A-Za-z0-9. But how to find words in another Language like Russian, Japanese…?

Greetings

first define how you recognize a “word” as \w wont pick up some “words” if it only uses A-Za-z0-9

WON’T wont be a word

nor would re-examine or any other hyphenated words

Meant A-Za-z0-9 will only match words with Latin Characters.

Should split

???? ?????? ???????????

in the same way like this. Every Time the String is built with 3 Substrings. Ich would love to extract every Single one.

Pjotr Iljitsch Tschaikowski Ptr Ili? ?ajkovskij

well, you could define that all non ascii characters are okay.
So you would split by whitespace and take all others.

Ok found out, my Testapp RegEx Knife (iOS) won’t match these, but other Apps. So \w* works fine. Hope Xojo’s RegEx to :slight_smile:

PCRE has tokens to recognize Unicode characters.

[quote=245407:@Christian Schmitz]well, you could define that all non ascii characters are okay.
So you would split by whitespace and take all others.[/quote]
That too would be far to simplistic though
A sentence MAY have whitespace around punctuation and a picky editor might require it but it’s not required to do so.That would mean that in this text I’m writing “so.That” would be one word - clearly incorrect.
Yet readers can discern what is and is not a word quite readily.
Sadly human writing is quite complex, at least english is, and the rules required to recognize what is and is not a “word” are also commensurately complicated.
And white space isn’t always a delimiter :slight_smile: fleur de lis is a nice example of something that is often treated as “one word”
You will often see it as fleur-de-lis. But here we may even see it written as fleur de lis - yet still “one word”

Gotta love english - it will steal anything from anyone and then use it as if it owned it forever

What do you want to accomplish and how good enough does the result have to be? Languages are a very complicated topic.

Languages are almost as bad as talking about dates & time :stuck_out_tongue: