FolderItem.name encoding subtly different between 2019R1.1 and 2019R3

Jonathan_Ashwell · January 13, 2020, 9:54pm

My app stores the names of folderItems, such as PDFs, and then finds and displays them when the users selects the name from a popup menu. I ran into a subtle change in 2019R3 that broke existing code and may affect others.

Although the file name is UTF-8 encoded in both versions of Xojo, the bytes differ (precomposed vs. decomposed?)? Here’s an example for the file name

Krmer 2013.pdf

In R1.1, the bytes are

4B72 C3A4 6D65 7220 3230 2E70 6466

In R3 they are

4B72 61CC 886D 6572 2032 3031 332E 7064 66

The good news is that when I search for f.child(“Krmer 2013.pdf”) with the first encoding it finds the file whose name has the second encoding (in the IDE).

Norman_Palardy · January 13, 2020, 10:19pm

Not surprising since the underpinnings of Folderitem did get replaced with much update API calls particularly for macOS
The difference could be down to how those macOS api’s return the names

TimStreater · January 13, 2020, 10:26pm

[quote=471339:@Jonathan Ashwell]Krämer 2013.pdf

In R1.1, the bytes are

4B72 C3A4 6D65 7220 3230 2E70 6466

In R3 they are

4B72 61CC 886D 6572 2032 3031 332E 7064 66.[/quote]

R1.1

U+00E4 ä c3 a4 LATIN SMALL LETTER A WITH DIAERESIS

R3

61 is ‘a’, then followed by:

U+0308 ? cc 88 COMBINING DIAERESIS

(UTF-8 info from https://www.utf8-chartable.de/unicode-utf8-table.pl )

Jonathan_Ashwell · January 13, 2020, 10:27pm

You may not be surprised but I certainly didn’t expect this change in behavior (or I wouldn’t have spent hours tracking down the problem). I understand why it happened and didn’t post here because I thought it was a bug. I posted so that others who encounter this it might figure it out more quickly than I did.

TimStreater · January 13, 2020, 10:29pm

While that is good, better would be to request a note in the documentation.

Tim_Parnell · January 13, 2020, 10:42pm

http://documentation.xojo.com/resources/release_notes/2019r2.html
“FolderItem updated to use latest OS APIs on macOS.”

Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.

Mike_D · January 13, 2020, 10:46pm

I also find this issue shows up when moving files between mac and Windows filesystems (prior to 2019).

See https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277 for a MBS-based way to compose or decompose. https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277

TimStreater · January 13, 2020, 11:12pm

[quote=471345:@Tim Parnell]http://documentation.xojo.com/resources/release_notes/2019r2.html
“FolderItem updated to use latest OS APIs on macOS.”

Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.[/quote]
Release Notes are interesting, but shouldn’t be a source of documentation, any more than threads here should be. I think it’s entirely reasonable for the change in OS APIs on macOS to be mentioned on the doc page for FolderItem, along with any side-effects that people can report as and when they find them.

In fact, more generally, one might copy what the PHP pages do, where each PHP documentation page includes, at the end, user-contributed notes which would be ideal for this sort of thing.

Michael_Hußmann · January 14, 2020, 2:01am

In Unicode, different (i.e. decomposed and precomposed) sequences of code points and thus bytes may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.

Christian_Schmitz · January 14, 2020, 1:01pm

If you use MBS Xojo Plugins, you can use ConvertUnicodeToCharacterCompositionMBS and ConvertUnicodeToCharacterDecompositionMBS to convert.

Norman_Palardy · January 14, 2020, 3:02pm

[quote=471357:@Michael Hußmann]In Unicode, different (i.e. decomposed and precomposed) sequences of code points and thus bytes may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.[/quote]
except xojo does NOT behave this way - although maybe it should ?

Dim mb1 As New memoryblock(16)
mb1.UInt8Value(0) = &h4B
mb1.UInt8Value(1) = &h72 
mb1.UInt8Value(2) = &hC3
mb1.UInt8Value(3) = &hA4 
mb1.UInt8Value(4) = &h6D
mb1.UInt8Value(5) = &h65 
mb1.UInt8Value(6) = &h72
mb1.UInt8Value(7) = &h20 
mb1.UInt8Value(8) = &h32
mb1.UInt8Value(9) = &h30 
mb1.UInt8Value(10) = &h31 
mb1.UInt8Value(11) = &h33
mb1.UInt8Value(12) = &h2E 
mb1.UInt8Value(13) = &h70
mb1.UInt8Value(14) = &h64 
mb1.UInt8Value(15) = &h66

Dim mb2 As New memoryblock(17)

mb2.UInt8Value(0) = &h4B
mb2.UInt8Value(1) = &h72 
mb2.UInt8Value(2) = &h61
mb2.UInt8Value(3) = &hCC 
mb2.UInt8Value(4) = &h88
mb2.UInt8Value(5) = &h6D 
mb2.UInt8Value(6) = &h65
mb2.UInt8Value(7) = &h72 
mb2.UInt8Value(8) = &h20
mb2.UInt8Value(9) = &h32 
mb2.UInt8Value(10) = &h30
mb2.UInt8Value(11) = &h31 
mb2.UInt8Value(12) = &h33
mb2.UInt8Value(13) = &h2E 
mb2.UInt8Value(14) = &h70
mb2.UInt8Value(15) = &h64 
mb2.UInt8Value(16) = &h66


Dim str1 As String = DefineEncoding(mb1, Encodings.UTF8)
Dim str2 As String = DefineEncoding(mb2, Encodings.UTF8)

If mb1 = mb2 Then
  Break
End If

If str1 = str2 Then
  break
End If

you will not hit either break point

Norman_Palardy · January 14, 2020, 3:17pm

Xojo does not correctly implement canonical equivalence from the unicode standard
see UAX #15: Unicode Normalization Forms
<https://xojo.com/issue/58838>

Michael_Hußmann · January 14, 2020, 6:00pm

Yeah, I suppose the operator = should respect canonical equivalence. It is probably a good idea to use String.Compare instead which already does.

Norman_Palardy · January 14, 2020, 6:04pm

It would not surprise me to find that most people would expect = to do the same

Tim_Jones · January 15, 2020, 4:39pm

I have to thank Jonathon for starting this thread. I ran into this and just marked it up to one more reason for me to completely ignore r2.1+ for real work.

John_McKernon · January 15, 2020, 10:29pm

This seems like something that would be easy for Xojo to implement, so we don’t have to wade through miles of code looking for places where strings are compared using = and then replacing = with a function would be tedious in the extreme, to say nothing of prone to mistakes.

Or is there some existing easy way to implement this?

Norman_Palardy · January 15, 2020, 10:52pm

Xojo already has it implemented just not when using the = operator between two strings
Making that work should be VERY doable

Michael_Hußmann · January 16, 2020, 12:55am

Testing for equality with = is much faster than using String.Compare and I suppose we wouldnt want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).

Until canonical equivalence is properly implemented throughout Xojo, using String.Compare for any text received from the outside world may yet be your best bet.

I have tried converting the encoding from UTF8 to some other variant of Unicode and back, hoping the text would be normalized in the process, but to no avail.

kevin_g · January 17, 2020, 10:33am

[quote=471614:@Michael Hußmann]Testing for equality with = is much faster than using String.Compare and I suppose we wouldnt want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).
[/quote]
We wouldn’t want strings to be normalized automatically or the performance of = to be reduced.
Maybe the strings should have a Normalize method for people who need it or just use the String.Compare method that is available today.

Norman_Palardy · January 17, 2020, 2:40pm

that = doesnt treat ü and ü as “equal” is really a problem
I would expect that most users would be VERY surprised that it doesnt and it would be the unusual case where you would NOT want it to do that