Hi. I need a search pattern and method for detecting whether a string contains a specific language.
I found this https://forum.xojo.com/35465-check-if-string-is-cyrillic/0 which was very helpful and Kem’s solution below works for detecting if the text is all in one of the specified language forms but not if the text has a mixture of say Latin based language and Arabic etc:
Hi Kem. So for example, a line of text might be: “Western Persian is known as Parsi (???)”.
So I need to know whether or not that text contains any Arabic. It doesn’t matter if it contains Latin also (as it does) I just need to know whether it contains any Arabic… which it does in the brackets… but the brackets are just there for clarity.
But in a working scenario, I need the RegEx to check whether a line of text contains any specified complex text forms such as Arabic, Hebrew, Cyrillic etc as in your original example above…but modified for what I need.
It sounds like you just need to match languages so you can use an alternator, something like this:
\\p{Arabic}{2,}|\\p{Latin}{2,}
This will match any text of that language as long as there are two or more consecutive letters. You can add as many languages to that as you’d like as long as you keep that form.
(If you have RegExRX, you can see a list of the various languages under the Insert menu next to the “Search Pattern” label. Go to Unicode Scripts & Properties -> Scripts.)
Notice how simple the pattern is. Generally, patterns should match the bare minimum of what you need, so your case (if I understood it) was easy. We don’t care what else is in the text, only the languages you’re looking for.
It shouldn’t make a difference for the first match. However, if you try to get every match, Xojo’s RegEx implementation slows down. The MBS RegEx engine is much better, if you have it.
To find every match, you would use code like this:
dim rx as new RegEx
rx.SearchPattern = pattern
dim match as RegExMatch = rx.Search( s )
while match isa RegExMatch
//
// Do something with the match
// then get the next one with ...
//
match = rx.Search
wend
Here, probably not. But I generally avoid comparisons to nil with classes because it will force the class’ Operator_Compare method if it has one, and that’s not typically what I want. I will either use if o is nil ... or if o isa object ..., meaning, “not nil”.
Hi again. A quick follow-up regarding pattern below:
\\p{Latin}{2,}
I assume that pattern would also return a match if a space, end of line and punctuation is found … correct? If so, how can I modify the pattern to exclude, spaces, end of line and punctuation so it only returns a match if it finds a letter?
Is there a RegEx pattern to see if a string contains any uppercase characters?