FolderItem.name encoding subtly different between 2019R1.1 and 2019R3

My app stores the names of folderItems, such as PDFs, and then finds and displays them when the users selects the name from a popup menu. I ran into a subtle change in 2019R3 that broke existing code and may affect others.

Although the file name is UTF-8 encoded in both versions of Xojo, the bytes differ (precomposed vs. decomposed?)? Here’s an example for the file name

Krmer 2013.pdf

In R1.1, the bytes are

4B72 C3A4 6D65 7220 3230 2E70 6466

In R3 they are

4B72 61CC 886D 6572 2032 3031 332E 7064 66

The good news is that when I search for f.child(“Krmer 2013.pdf”) with the first encoding it finds the file whose name has the second encoding (in the IDE).

Not surprising since the underpinnings of Folderitem did get replaced with much update API calls particularly for macOS
The difference could be down to how those macOS api’s return the names

[quote=471339:@Jonathan Ashwell]Krämer 2013.pdf

In R1.1, the bytes are

4B72 C3A4 6D65 7220 3230 2E70 6466

In R3 they are

4B72 61CC 886D 6572 2032 3031 332E 7064 66.[/quote]

R1.1

U+00E4 ä c3 a4 LATIN SMALL LETTER A WITH DIAERESIS

R3

61 is ‘a’, then followed by:

U+0308 ? cc 88 COMBINING DIAERESIS

(UTF-8 info from https://www.utf8-chartable.de/unicode-utf8-table.pl )

You may not be surprised but I certainly didn’t expect this change in behavior (or I wouldn’t have spent hours tracking down the problem). I understand why it happened and didn’t post here because I thought it was a bug. I posted so that others who encounter this it might figure it out more quickly than I did.

While that is good, better would be to request a note in the documentation.

http://documentation.xojo.com/resources/release_notes/2019r2.html
“FolderItem updated to use latest OS APIs on macOS.”

Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.

I also find this issue shows up when moving files between mac and Windows filesystems (prior to 2019).

See https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277 for a MBS-based way to compose or decompose. https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277

[quote=471345:@Tim Parnell]http://documentation.xojo.com/resources/release_notes/2019r2.html
“FolderItem updated to use latest OS APIs on macOS.”

Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.[/quote]
Release Notes are interesting, but shouldn’t be a source of documentation, any more than threads here should be. I think it’s entirely reasonable for the change in OS APIs on macOS to be mentioned on the doc page for FolderItem, along with any side-effects that people can report as and when they find them.

In fact, more generally, one might copy what the PHP pages do, where each PHP documentation page includes, at the end, user-contributed notes which would be ideal for this sort of thing.

In Unicode, different (i.e. decomposed and precomposed) sequences of code points – and thus bytes – may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.

If you use MBS Xojo Plugins, you can use ConvertUnicodeToCharacterCompositionMBS and ConvertUnicodeToCharacterDecompositionMBS to convert.

[quote=471357:@Michael Hußmann]In Unicode, different (i.e. decomposed and precomposed) sequences of code points – and thus bytes – may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.[/quote]
except xojo does NOT behave this way - although maybe it should ?

Dim mb1 As New memoryblock(16)
mb1.UInt8Value(0) = &h4B
mb1.UInt8Value(1) = &h72 
mb1.UInt8Value(2) = &hC3
mb1.UInt8Value(3) = &hA4 
mb1.UInt8Value(4) = &h6D
mb1.UInt8Value(5) = &h65 
mb1.UInt8Value(6) = &h72
mb1.UInt8Value(7) = &h20 
mb1.UInt8Value(8) = &h32
mb1.UInt8Value(9) = &h30 
mb1.UInt8Value(10) = &h31 
mb1.UInt8Value(11) = &h33
mb1.UInt8Value(12) = &h2E 
mb1.UInt8Value(13) = &h70
mb1.UInt8Value(14) = &h64 
mb1.UInt8Value(15) = &h66

Dim mb2 As New memoryblock(17)

mb2.UInt8Value(0) = &h4B
mb2.UInt8Value(1) = &h72 
mb2.UInt8Value(2) = &h61
mb2.UInt8Value(3) = &hCC 
mb2.UInt8Value(4) = &h88
mb2.UInt8Value(5) = &h6D 
mb2.UInt8Value(6) = &h65
mb2.UInt8Value(7) = &h72 
mb2.UInt8Value(8) = &h20
mb2.UInt8Value(9) = &h32 
mb2.UInt8Value(10) = &h30
mb2.UInt8Value(11) = &h31 
mb2.UInt8Value(12) = &h33
mb2.UInt8Value(13) = &h2E 
mb2.UInt8Value(14) = &h70
mb2.UInt8Value(15) = &h64 
mb2.UInt8Value(16) = &h66


Dim str1 As String = DefineEncoding(mb1, Encodings.UTF8)
Dim str2 As String = DefineEncoding(mb2, Encodings.UTF8)

If mb1 = mb2 Then
  Break
End If

If str1 = str2 Then
  break
End If

you will not hit either break point

Xojo does not correctly implement canonical equivalence from the unicode standard
see UAX #15: Unicode Normalization Forms
<https://xojo.com/issue/58838>

Yeah, I suppose the operator = should respect canonical equivalence. It is probably a good idea to use String.Compare instead – which already does.

It would not surprise me to find that most people would expect = to do the same

I have to thank Jonathon for starting this thread. I ran into this and just marked it up to one more reason for me to completely ignore r2.1+ for real work.

This seems like something that would be easy for Xojo to implement, so we don’t have to wade through miles of code looking for places where strings are compared using = and then replacing = with a function would be tedious in the extreme, to say nothing of prone to mistakes.

Or is there some existing easy way to implement this?

Xojo already has it implemented just not when using the = operator between two strings :frowning:
Making that work should be VERY doable

Testing for equality with = is much faster than using String.Compare and I suppose we wouldn‘t want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).

Until canonical equivalence is properly implemented throughout Xojo, using String.Compare for any text received from the outside world may yet be your best bet.

I have tried converting the encoding from UTF8 to some other variant of Unicode and back, hoping the text would be normalized in the process, but to no avail.

[quote=471614:@Michael Hußmann]Testing for equality with = is much faster than using String.Compare and I suppose we wouldn‘t want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).
[/quote]
We wouldn’t want strings to be normalized automatically or the performance of = to be reduced.
Maybe the strings should have a Normalize method for people who need it or just use the String.Compare method that is available today.

that = doesnt treat ü and ü as “equal” is really a problem
I would expect that most users would be VERY surprised that it doesnt and it would be the unusual case where you would NOT want it to do that