How to Compare UTF8 Strings visually equal

Arthur_Gabhart · July 10, 2018, 8:13pm

I have in a textarea the Korean characters "?? (1)
I have found 3 different binary compositions of this string.

EFBB BFEA B8B8 EC9D B420 2831 29 (length 7)
EAB8 B8EC 9DB4 2028 3129 (length 6)
EAB8 B8EC 9DB4 2020 2831 29 (length 7)

The first one is the string stored in memory (and also the only one on Windows)
The second one is when the text is read into a read only area.
The third one is after the text has entered the TextChange event.

Is there a way to account for the way these strings are in binary?
Is there a comparison method that I could be using.
Would it be different if I used Xojo.Core.Text

Christian_Schmitz · July 10, 2018, 8:34pm

You need to do Unicode normalization.
We do have functions for that in MBS Plugins.

Arthur_Gabhart · July 10, 2018, 8:46pm

My confusion is this is what XOJO gives me. It should go from one space to another on the same machine instance and have the same binary values.

Christian_Schmitz · July 10, 2018, 9:05pm

After a closer look.

is with UTF-8 BOM on front. You can remove the first 3 bytes.
is the right one.
has two spaces instead of one.

So there is no problem, except maybe one on the chair.

Arthur_Gabhart · July 10, 2018, 9:17pm

The BOM addition makes a little sense.
That is easier to control for comparison, but I’m confused why it’s there in some cases and not in others.

Arthur_Gabhart · July 11, 2018, 1:59am

I know what you mean. I added an improvement to my files, but I forgot to take the BOM off of the file. It bit back.