RegExMatch - SubExpressionStartB gives erroneous results when diacritical marks are present in text

Feedback Case Number: 69414

SubExpressionStartB gives erroneous results when diacritical marks are present in text, because it does not account for them. Tested with German Umlaute ‘öäü’ and French ‘ç’.

I know that the excellent MBS plugin is offering PCRE2 and that a feature request for this has been posted in the year 2007 by @Fabian_Eschrich (case number 4801).

Whoever is dealing with languages other than English may appreciate getting this straight and right out of the box. Therefore I posted it as a bug and not as a missing feature.

P.S. haven’t found any workarounds for this (apart from using a plugin).

1 Like

Can you post the code you are using to test this?

Sure, please have a look at issue #69414. It has a sample project attached.

Why are you trying to use a byte method on UTF8 text? That can’t work. At all.

Not a bug.

What do you suggest?

It would help to understand how UTF-8 works. This is a good starting point:

In short, a string holds bytes, and, when UTF-8 encoded, 1, 2, 3, or 4 bytes represent a character. In your string the characters are made up of more than one byte, so while the character position of “ä” may be 2, the byte position is higher. RegExMatch returns the byte position, so it’s up to you to translate that to the character position that the text control is looking for.

The easiest way to do that is to chop off the leftmost bytes of your original string using LeftBytes and getting the Length of that.

3 Likes

Thank you @Kem_Tekinay! Got it working and can close #69414.

The code:

' sample is a Text Area control
sample.SelectionStart = sample.Text.LeftBytes(match.SubExpressionStartB(0)).Length
1 Like