RegEx Question

Hi. I need a search pattern and method for detecting whether a string contains a specific language.

I found this https://forum.xojo.com/35465-check-if-string-is-cyrillic/0 which was very helpful and Kem’s solution below works for detecting if the text is all in one of the specified language forms but not if the text has a mixture of say Latin based language and Arabic etc:

\\A(?:(\\PL+)|([\\p{Cyrillic}\\PL]+)|([\\p{Latin}\\PL]+)|([\\p{Greek}\\PL]+)|([\\p{Hebrew}\\PL]+)|([\\p{Syriac}\\PL]+)|([\\p{Bengali}\\PL]+))\\z

So to be clear: I need to know if a line of text contains Arabic or Cyrillic (for example) but may also contain Latin.

Any help would be greatly appreciated.

So the text may contain a mixture of languages, but must only be a mixture of those languages?

An example would help, btw.

Hi Kem. So for example, a line of text might be: “Western Persian is known as Parsi (???)”.

So I need to know whether or not that text contains any Arabic. It doesn’t matter if it contains Latin also (as it does) I just need to know whether it contains any Arabic… which it does in the brackets… but the brackets are just there for clarity.

But in a working scenario, I need the RegEx to check whether a line of text contains any specified complex text forms such as Arabic, Hebrew, Cyrillic etc as in your original example above…but modified for what I need.

Does that make more sense?

It sounds like you just need to match languages so you can use an alternator, something like this:

\\p{Arabic}{2,}|\\p{Latin}{2,}

This will match any text of that language as long as there are two or more consecutive letters. You can add as many languages to that as you’d like as long as you keep that form.

(If you have RegExRX, you can see a list of the various languages under the Insert menu next to the “Search Pattern” label. Go to Unicode Scripts & Properties -> Scripts.)

Notice how simple the pattern is. Generally, patterns should match the bare minimum of what you need, so your case (if I understood it) was easy. We don’t care what else is in the text, only the languages you’re looking for.

Thanks, Kem! That seems to work and you’ve made it so simple too.

Quick follow-ups:

  1. Will that work for multiline also?
  2. I’m not sure how RegEx works in terms of speed but would it make a difference if you passed it a line of text or 100,000 words of text… for example?
  3. Does it return the match on the first match found and then exit checking or does it process the entire string passed to it and return all the matches?
  1. Yes.
  2. It shouldn’t make a difference for the first match. However, if you try to get every match, Xojo’s RegEx implementation slows down. The MBS RegEx engine is much better, if you have it.
  3. To find every match, you would use code like this:
dim rx as new RegEx
rx.SearchPattern = pattern

dim match as RegExMatch = rx.Search( s )
while match isa RegExMatch
  //
  // Do something with the match
  // then get the next one with ...
  //
  match = rx.Search
wend

Great. Thank you so much!

everyday’s regex lesson with Kem …
you should launch a youtube channel :wink:

you’re the Paul Seller’s master for regex …

I’ve always used: while match <> nil

Is there an advantage to using isa?

Here, probably not. But I generally avoid comparisons to nil with classes because it will force the class’ Operator_Compare method if it has one, and that’s not typically what I want. I will either use if o is nil ... or if o isa object ..., meaning, “not nil”.

[quote=364687:@Jean-Yves Pochez]everyday’s regex lesson with Kem …
you should launch a youtube channel :wink:

you’re the Paul Seller’s master for regex …[/quote]

Kem can call his YouTube channel “Learning Klingon with Kem” :stuck_out_tongue:

Hi again. A quick follow-up regarding pattern below:

\\p{Latin}{2,}
  1. I assume that pattern would also return a match if a space, end of line and punctuation is found … correct? If so, how can I modify the pattern to exclude, spaces, end of line and punctuation so it only returns a match if it finds a letter?

  2. Is there a RegEx pattern to see if a string contains any uppercase characters?

  1. That’s already what the pattern does. You can check that code or with a utility like RegExRX.

  2. Using Unicode tokens…

\\p{Lu}

This will match any uppercase letter in a string and is useful really only as a Boolean, i.e., does it match or not?

Great, thank you for clarifying and Happy New Year!