Help with unicode-friendly regex whole word searches

Jonathan_Ashwell · July 10, 2024, 12:08pm

My app has a function that changes the case of a string to Title Case (every word begins with a capital letter). It has a user-configurable stop list that contains words whose case is not to be changed (e.g. DNA). I use regex to search for the presence of such words in the string and ensure that they appear as in the stop list. A user has reported an edge case where mistakes can occur when a word as accented Unicode characters. I can see where the problem lies, but I can’t come up with a solution. I’m using a search pattern that I’m pretty sure @Kem_Tekinay helped me with:

(?<!\w)\Q" + stopWord + "\E(?!\w)

The intent is that only whole words will be found, and the characters between \Q and \E (the stop word) are treated as literals. The problem is that \w only deals with ASCII values. This is a real life example that fails:

stopWord = CE
string = Gréce

The search identifies the “ce” after the é as a match, and the output becomes

GréCE

If the é is changed to e the output is correct (Grece).

Suggestions on how to deal with such examples are appreciated.

Sascha_S · July 10, 2024, 1:00pm

Would using POSIX Bracket Expressions help, maybe?

(?<![^[:print:]])\Q" + stopWord + "\E(?![^[:print:]])

Replaced the \w (word char) with [:print:] (visible chars and spaces).
Maybe you then need to rule out just the spaces?

Screenshot-1
Screenshot-2
Screenshot-4

Jonathan_Ashwell · July 10, 2024, 1:13pm

Thanks, but I’m afraid not. I’m testing in RegExRX, and if I use

(?<![^[:print:]])\QCE\E(?![^[:print:]])

with the word

 cerces

I get two matches when there should be none.

Kem_Tekinay · July 10, 2024, 1:37pm

Try this:

(?<=\pZ|^)\QCERCES\E(?=\pZ|$)

Jonathan_Ashwell · July 10, 2024, 2:22pm

@Kem_Tekinay Thank you, Kem, this is very close. It correctly recognizes that an accented character should not be interpreted as end of word. But it fails to find a match when the word is followed by punctuation.

For example, “dna” is found if it is freestanding, but not if it is followed by punctuation.

Kem_Tekinay · July 10, 2024, 2:28pm

Try this:

(?<=\pZ|^)\Qcérces\E(?=\pZ|\pP|$)

In RegExRX, under the pop-up menu next to Search Pattern, you will find all kinds of tokens. Under “Unicode Scripts & Properties” you will find the symbols you can use with \p or \P along with their explanations.

Jonathan_Ashwell · July 10, 2024, 2:58pm

Yes, that did the trick!

I had actually explored that menu, but you have to somewhat know what the answer is going to be understand and apply the information. I did a lot of googling before posting, and answers to similar questions were all over the place, many restricting matches to Unicode character ranges and such. I figured there must be a more elegant solution (searches that take into account accented characters seem pretty basic, especially outside the US). Thanks for providing it.