Special Character folder item child encoding

I have to access files with special characters like ü; Which encoding should I use to define the path using a child command?
When I used Specialfolder.desktop.child(“fürmich.txt”) I do not get the correct path.
When I use f = new folderitem(“C:\Users\me\Desktop\fürmich.txt”,folderitem.pathmodes.natiive) I get the correct file path (it exists, whereas the one gotten through Child does not exist.

I notice a binary difference between the paths I get from both, but would like to know how to encode it properly using a Child command.

I would say its a bug if that does not work.

The Child command should take any encoding as long as it has a valid encoding. And String literal like you have does have valid encoding.

It works for me, double check that you don’t have a gremlin in the filename (select the line, right click, Clean Invisible ASCII characters.

The folderitem is valid, but doesn’t exist when created using the Child command. There is a binary difference caused by the ü. the file isn’t created with Xojo, but created by the user, in this case on a remote Mac. How was your txt file created?

To show the difference in the actual app, here is a binary view of both paths, one created with the Child command, the other using new folder item;

The top one exists (made with the new folderitem(pathstring,FolderItem.Pathmodes.Native) and the bottom one does not exist. They show the same as readable text but in binary are not. why?

A text (string) typed in the IDE (or Pasted) is different than a file name taken in the Finder/Explorer (item name).

In your case, as defined in the original question, I feel a different Windows API is used and leads to the difference you found (how the ü is defined).

But, if the user select the file, you will not reach the trouble (bug ?).

Sorry, but I do not have a better answer to help resolve this question.

That’s the cause. There’s 2 valid and different ways of encoding accented chars in unicode. A composed code like " ü " and a “composing sequence” (grapheme cluster) like " u ← ¨ ". The binary way a Mac chose to encode differs from the way the Windows usually does. When you write the word, you end with a type, when you copy/paste you can bring the other.

I don’t think so: the ide change the string at paste time (on macOS, Linux ?, Windows ?)…

Your mind is fixed in terms of macOS, but what about receiving a file from a Mac, with a grapheme cluster in the name, dropping it in a Windows folder, copying the filepath/filename from there (including the grapheme) and writing a fixed name in the IDE (composed), by hand, in a foldertitem child expression, and in another place pasting the file path you got from that copy? Both probably differ.

I do not know.

But what I know is I got troubles when my file name have á, à, â, ä, etc. and I want to load it by code vs using a dialog because the UTF representation is different.

When I copy the file name from the Finder (macOS), I cannot use .Child(“Straßburg.png”) to load it from disk.

I cannot talk about the other OS because of M1/No other OS installed here (nor handly).

I never opt for using accented chars for DB filenames, config files, and so on because of it. Preventing multiplatform clashes when moving files around.

For this, I am 100% OK with you.

Unfortunately, when you deal with files from unknow location, it is difficult.

I discover what I wrote above when I wanted to resize French magazine cover that have an “ô” in its name; the images were embedded in an html document… and I wanted to keep the magazine name in the image file name (instead of a variable number like mag_0001.png)

BTW: depending on the use, a ListBox Header String leads us into troubles if it hold non ASCII characters… even “#” is rejected by SQLite… So I convert them automatically…

Sorry Boudewijn, I digress.

Yes.

1 Like

The problem is indeed that I don’t control the name the user gives to the file, and in languages like German, many non standard characters are used. The only solution I see is to scan the binary for the possible incorrect codes, and create a new binary with the correct code, and convert that back to a string.
It seems like a lot of work that in my opinion should be handled by some function.
If anyone has a better suggestion, I’d love to hear it.

They may be different but both are equally valid. Problem is caused by the fact that one platform chooses one way to make a ü, the other chooses the other way. This is an issue with Unicode.

And how do you deal with that ?

Its obvious bug in Xojo since the way file system method works (and really any String functions) is that it takes String in valid encoding and converts it to whatever it needs to use for the given OS API. In this case Windows multibyte wide string (that’s what it should convert to anyhow internally).

So given this then if everything was all right, then as long as you have string with Valid encoding on it and content that matches the encoding no matter which encoding it is then it should end correctly.

There are number of things that could be wrong in the Child function, like they could be for example converting it to UTF16 which will get you sometimes all right results but sometimes not (since UTF16 is not correct candidate).

When dealing with Windows API calls inside my plugins then I for example take in Xojo string, then I convert it to UTF8, and from UTF8 then convert using Windows API calls to Windows Multibyte. (which then gives you correct results).

So I would say you should submit bug on the Child function on the FolderItem class.

I’m not really sure what you’re expecting, they’re both valid and separate file names. I used a hex editor and make two files with the names using the hex you provided above.

Dim f1 As FolderItem = Specialfolder.desktop.child("boudewijn").child("für mich.cfg") '6675CC8872206D6963682E636667
Dim f2 As FolderItem = Specialfolder.desktop.child("boudewijn").child("für mich.cfg") '66C3BC72206D6963682E636667
Dim f3 As New FolderItem("C:\Users\Julian\Desktop\boudewijn\für mich.cfg") '6675CC8872206D6963682E636667
Dim f4 As New FolderItem("C:\Users\Julian\Desktop\boudewijn\für mich.cfg") '66C3BC72206D6963682E636667

system.DebugLog(f1.Name + " " + EncodeHex(f1.name))
system.DebugLog(f2.Name + " " + EncodeHex(f2.name))
system.DebugLog(f3.Name + " " + EncodeHex(f3.name))
system.DebugLog(f4.Name + " " + EncodeHex(f4.name))

Dim d As FolderItem = SpecialFolder.Desktop.child("boudewijn")
For Each file As Folderitem In d.Children
  If file <> Nil Then
    system.DebugLog(file.Name + " " + EncodeHex(file.name))
  End If
Next

output from code above

für mich.cfg 6675CC8872206D6963682E636667
für mich.cfg 66C3BC72206D6963682E636667
für mich.cfg 6675CC8872206D6963682E636667
für mich.cfg 66C3BC72206D6963682E636667
für mich.cfg 6675CC8872206D6963682E636667
für mich.cfg 66C3BC72206D6963682E636667

If you type ü into a textfield and return its encodehex on keydown what value to you see in windows and mac?

I see ü C3BC in both, so the only way for someone to get the 75CC88 version into the filename is to paste it from another source?

It may look the same to a reader but it’s not the same to a computer. Do you want it to detect this somehow? Because it won’t care about it if you are itereting over the list of files in a folder or picking the file with a file file chooser. I’m not really sure where the issue lies in how you get to the problem you’re describing of Child not working, when it clearly is working as demonstrated above. Do you want the user to type ü into a filename and it to find both versions?

Eventhough they are both valid file paths, the one derived from Child does not match the actual file’s file path, which prevents code from opening the file. What I’d like is a way to get to the correct path using the Child command, preferably without going through each possible character Hex lookup and replace.

Child is reading the files separately and correctly as seen in the output above.

I’m not sure where you’re getting that Child is somehow incorrect, if you type C3BC into the full path it wont find 75CC88 either.

You want child to have a “fuzzy” selection of both C3BC and 75CC88 ?