RegExMatch - SubExpressionStartB gives erroneous results when diacritical marks are present in text

Torsten_Bernhard · July 20, 2022, 1:38pm

Feedback Case Number: 69414

SubExpressionStartB gives erroneous results when diacritical marks are present in text, because it does not account for them. Tested with German Umlaute ‘öäü’ and French ‘ç’.

I know that the excellent MBS plugin is offering PCRE2 and that a feature request for this has been posted in the year 2007 by @Fabian_Eschrich (case number 4801).

Whoever is dealing with languages other than English may appreciate getting this straight and right out of the box. Therefore I posted it as a bug and not as a missing feature.

P.S. haven’t found any workarounds for this (apart from using a plugin).

Kem_Tekinay · July 20, 2022, 1:43pm

Can you post the code you are using to test this?

Torsten_Bernhard · July 20, 2022, 1:44pm

Sure, please have a look at issue #69414. It has a sample project attached.

Beatrix_Willius · July 20, 2022, 1:55pm

Why are you trying to use a byte method on UTF8 text? That can’t work. At all.

Not a bug.

Torsten_Bernhard · July 20, 2022, 2:07pm

What do you suggest?

Kem_Tekinay · July 20, 2022, 5:54pm

It would help to understand how UTF-8 works. This is a good starting point:

In short, a string holds bytes, and, when UTF-8 encoded, 1, 2, 3, or 4 bytes represent a character. In your string the characters are made up of more than one byte, so while the character position of “ä” may be 2, the byte position is higher. RegExMatch returns the byte position, so it’s up to you to translate that to the character position that the text control is looking for.

The easiest way to do that is to chop off the leftmost bytes of your original string using LeftBytes and getting the Length of that.

Torsten_Bernhard · July 20, 2022, 10:49pm

Thank you @Kem_Tekinay! Got it working and can close #69414.

The code:

' sample is a Text Area control
sample.SelectionStart = sample.Text.LeftBytes(match.SubExpressionStartB(0)).Length