Filename Unicode Normalization in Mojave

Unicode filenames are problematic in macOS because Unicode allows multiple ways to express the same letter

From https://eclecticlight.co/2017/04/06/apfs-is-currently-unusable-with-most-non-english-languages/

This comes up in a few situations:

  1. cross-platform file transfer
  2. shell scripts / command-line tools

I had written some code to deal with this, and it seems to work fine in Mojave (10.14), but when I test the same code, it seems to break in High Sierra (10.13). Both systems are using APFS.

Is anyone aware of a change between 10.13 and 10.14 ?

In what way does it break?

Have you read the article above?
I can very well imagine a situation like this: Depending on how you’re creating a file “Jürg.txt”, you can or can’t access it by e.g. aFolder.Child(“Jürg.txt”). And you can even have two Files “Jürg.txt” in the very same folder. They look the same, but have a different “binary representation” for the filename, so two valid and distinct files for APFS (but not for all API’s).
It all boils down to UTF8 and de/composed representations.

One situation we’ve run into has been on iOS (when Apple forced all iOS devices to have APFS with iOS 10.3). A document saved with name “Zürich” could not be opened any more, because the iOS API (UIManagedDocument) was always using the “decomposited String” for the filename when opening the document. So if the file had been saved with a “composed String” filename, one has been (or maybe still is) out of luck.

I just hope Xojo is thinking about that when rewriting their FolderItem Framework and testing thoroughly files/folders with diacritics (and how they are de/composing what gets assigned as String, so that it works when it comes to APFS filesystem).

Michael said he had a Xojo solution that works, so my question was about the Xojo aspects of “it seems to break in High Sierra”. There are many different ways for things to break.

Does it switch to wrong characters?
Does it crash hard and die?
Does it raise an exception?
Does it create nil folder items?
Does it …

Still investgating, but it looks like what I’m seeing is this:
• Given a filename such as ä.pdf
• get the file path using Xojo FolderItem.ShellPath Edit: FolderItemNativePath
• on 10.13 : \\\\U00e4.pdf
• on 10.14 : a\\\\U0308.pdf

and I’m passing this path to an API where it fails with the \\U00e4 version but works with the a\\U0308 version

What I’m not sure about - is this an OS change? A Xojo bug… etc. I’m using 2019 R1.

the whole chain of events:

  • xojo app creates a binary file using a UTF8 string for the filename
  • from that folderItem, gets the NativePath as a string
  • the string is set to a plist (using CFPreferencesMBS)

(later, after it doesn’t work)

  • use “defaults read” to get the plist entry

EDIT: turns out I’m using “NativePath” not “ShellPath” - apologies.

So, if I understand correctly, using FolderItem1.Child(“name.ext”) can lead to trouble while all Dialog variants will work fine?

This may be unrelated, but what is String.Right(1) supposed to do with decomposed Unicode?

dim filename as string = ""
Dim f as FolderItem = GetFolderItem("").child(filename)

dim ta as TextArea = TextArea1
dim CR as EndOfLine

// these two lines behave as expected - the UTF8 string for  is C3 A4

ta.appendText  "name=" + filename + " : " + EncodeHex(filename,true) + CR
ta.appendText "f.name=" + f.Name + " : " + EncodeHex(f.name,true) + CR

// shellPath appears to use the decomposed form, where the  is encoded as 61 CC 88
dim sp as string = f.ShellPath
ta.appendText  "f.ShellPath= " + sp  + " : " + EncodeHex(sp, true) +CR

// Here is where it gets wacky:  taking the right-most character should give us  
// but instead it displays as  and the hex representation is CC 88
dim sp1 as string = sp.right(1)
ta.appendText  "sp1= " + sp1  + " : " + EncodeHex(sp1, true) + CR

// if we grab the right-most 3 bytes then it displays properly as 
dim sp3 as string = sp.rightB(3)
ta.appendText  "sp3= " + sp3  + " : " + EncodeHex(sp3, true) + CR

Looks like at least two issues:

What is that ?
(In xojo)

In MBS Plugin the ConvertUnicodeToCharacterDecompositionMBS(text as string) as string function may help.

See
https://www.monkeybreadsoftware.net/global-convertunicodetocharacterdecompositionmbs.shtml

For nearly all paths our plugin may do this to use the right path.

fun thing is I suspect if you used “text” right(1) would be correct :stuck_out_tongue:

@Christian Schmitz: but we don’t want to check the composition-precomposition EVERYWHERE. This should be handled at OS level.

A ton of functions in MBS Plugin do that conversion already.

Same for a lot of Xojo framework functions as well a Cocoa frameworks.

The behavior also seems to be OS -dependent: in macOS 10.11, folderItem.name and .nativePath are the same, but in 10.14 they differ. I’m not sure at which OS version it changed. Maybe APFS with 10.13?