Special Character folder item child encoding

Boudewijn_Krijger · February 27, 2022, 5:34pm

The filename is in fact coming from a file that is downloaded from an S3 bucket. It’s name was typed in by a user and I have no control what they type. The only possible other approach would be to remove special characters from the name and use that for further processing, but that’s not very elegant.
Is it strange that I expect a cross platform tool like Xojo to have a solution for problems derived from copying files cross platform? I even would not mind converting the path to another type and then converting it back, as long as I don’t have to tell the customer not to use special characters.

anon20074439 · February 27, 2022, 5:44pm

So they’ve uploaded a file to S3 with 75CC88 some how typed in but when they type the same key into your downloader app its C3BC so you’re looking for a file with the wrong name? Even if you added this to a full path that is sent to a new folderitem, I still don’t see how its an issue with Child, New FolderItem("… would also miss the file.

Boudewijn_Krijger · February 27, 2022, 5:49pm

Yes, that’s what’s happening. Child itself isn’t the problem, the filename Child is using has a different encoding than expected, resulting in a non existing file path.

TimStreater · February 27, 2022, 6:41pm

OK, I made a file abcü.txt on my Mac. I then found a bash script on Ask Ubuntu which can take a line or lines of text and output their hex. So, I then piped the output of ls into that, and got:

abcü.txt
61626375cc882e747874

That is, the filename on macOS has the u-umlaut made up using a u followed by the combining diaeresis, thus:

U+0075 u 75 LATIN SMALL LETTER U
U+0308 ̈ cc 88 COMBINING DIAERESIS

Now, copy that file to a stick formatted with a Windows filesystem and I imagine it leaves the filename alone. So now you have a file on the Windows host with a filename that it doesn’t know how to access.

What do you expect Windows to do, or Xojo, for that matter? If you separately create a file abcü.txt under Windows, it will presumably use the non-combining u-umlaut, thus:

U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS

How is a piece of software running on the Windows machine to know which file is required?

Thomas_ROBISSON · February 27, 2022, 6:49pm

I had a similar problem accentuated-characters-in-name-of-folderitem-catalina
Someone gave me a Method to normalize the encoding.

Public Function NrmStgEnc(Extends CeTexte as String, CeForm as UInt32 = 2) As String
  
  ' Cette Method normalise l'encodage d'une String. Un "é" (1 digit) peut aussi être encodé "´e" (2 digits)
  
  #IF TargetMacOS Then ' Depuis Catalina il y a des problème avec les encodages des accents. Un fichier nommé "Téiök" risque de retourner faux au test : (Fich.Name = "Téiök")
    If not(CeTexte.Encoding = Nil) Then ' Par contre il n'y a pas de problème quand je fais  .Child(Fich.Name)  donc dans ce cas je ne normalise pas l'encodage
      Declare Function CFStringCreateMutableCopy Lib "Foundation" (alloc as Ptr, maxLength as UInt32, TheString as CFStringRef) as CFStringRef
      Declare Sub CFStringNormalize Lib "Foundation" (TheString as CFStringRef, TheForm as UInt32)
      
      Dim mutableStringRef as CFStringRef = CFStringCreateMutableCopy(Nil, 0, CeTexte) ' Inutile : CeTexte.ConvertEncoding(Encodings.UTF8))
      
      CFStringNormalize(mutableStringRef, CeForm)
      ' CFStringNormalize mutableStringRef, 2 ' Dans un exemple c'était 2, dans l'autre il passait  CeForm  en paramètre
      CeTexte = mutableStringRef
      
      ' Avant je mettais l'un des encodages ci-dessous et ça corrigeait le problème
      ' CeTexte = CeTexte.ConvertEncoding(Encodings.UTF32LE) ' Encodings.UTF32BE ' Encodings.UTF16BE ' Encodings.UTF16LE ' Encodings.MacRoman
      ' En mettant un des encodages ci-dessous ça ne corrigeait pas le problème
      ' CeTexte = CeTexte.ConvertEncoding(Encodings.UTF32) ' Encodings.UTF8 ' Encodings.UTF16
      CeTexte = CeTexte.ConvertEncoding(DefAppEncod)
      
    End If
    
  #EndIf
  
  Return CeTexte
  
End Function

But it’s for Mac only.

Boudewijn_Krijger · February 27, 2022, 7:04pm

Maybe someone is able to create something similar for Windows?

Rick_Araujo · February 27, 2022, 7:04pm

He needs to write a fuzzy function like FolderItem.LocateChild(filename) that returns the child if child(filename) exists, and try a opposed counter part (composed if the passed filename contains clusters, or clustered it string contains composed accented chars) and if finding it, return it, and if not, should keep the original filename passed with the exists = false.

Boudewijn_Krijger · February 27, 2022, 7:05pm

That would work, but how do I convert the non existing filename to the existing one?

TimStreater · February 27, 2022, 7:05pm

How do you know you need to, that is the question.

Boudewijn_Krijger · February 27, 2022, 7:07pm

If it doesn’t need conversion, the path will exist and it can simply be used. If it does not exist, the filename needs to be converted, after which the path should exist. But the question remains how to convert?

TimStreater · February 27, 2022, 7:10pm

You would have to have a large list ofcharacters which have more than one form. If you limit yourself to Latin characters, that may not be so bad, but what about other scripts - Greek, Cyrillic, plenty of Asian scripts too, all catered for in Unicode.

Rick_Araujo · February 27, 2022, 7:10pm

Analyzing the string. You need to write 2 normalizers, one gets the string and makes a composed version of it, and another makes a decomposed one (clustered) and an analyzer to decide what composition is being used to try the next one later.

Boudewijn_Krijger · February 27, 2022, 7:18pm

I’d be ok with supporting Latin only for starters. But as this issue is not that unique, I feel like I’d be trying to re-invent the wheel. I’m certain there are more apps that suffer from the different encodings issues. I can’t imagine it was never solved by anyone. Maybe @Christian_Schmitz has a plugin that could help. The German language is full of non standard characters like ä, ü, ö and ẞ.

Rick_Araujo · February 27, 2022, 7:39pm

To decide if a string is composed or not, for most latin chars, you just need to look for the presence of a selection of codepoints of the series u-03xx

Probably those suffice:

Codepoint	char	Description
U+0300	̀	Combining Grave Accent
U+0301	́	Combining Acute Accent
U+0302	̂	Combining Circumflex Accent
U+0303	̃	Combining Tilde
U+0308	̈	Combining Diaeresis
U+030A	̊	Combining Ring Above
U+0327	̧	Combining Cedilla
U+0340	̀	Combining Grave Tone Mark
U+0341	́	Combining Acute Tone Mark

anon20074439 · February 27, 2022, 7:40pm

Public Function NormalizeString(s As String, form As Int32) As String
  If s = "" Then Return ""
  
  'see https://docs.microsoft.com/en-us/windows/win32/api/winnls/ne-winnls-norm_form for values for NormForm
  Declare Function NormalizeString Lib "Normaliz.dll" Alias "NormalizeString" ( _
  NormForm As Int32, _
  lpSrcString As WString, _
  cwSrcLength As Int32, _
  pDstString As Ptr, _
  cwDstLength As Int32 _
  ) As Int32
  
  Dim stringLength As Int32 = NormalizeString(form, s, s.Length, Nil, 0)
  Dim normalisedString As New MemoryBlock(stringLength)
  Dim ok As Int32 = NormalizeString(form, s, s.Length, normalisedString, stringLength)
  If ok <= 0 Then Return CType("", String).ConvertEncoding(s.Encoding) 'return "" if there was an error
  Return normalisedString.WString(0).ConvertEncoding(s.Encoding) 'return with the encoding the same as we received it
  
End Function

Using a Form of 1 will convert both of the types you mentioned previously into a singular type so you can now check if they match.

You will still need some type of normalized lookup storage when you download the file so you can link the typed in one to the downloaded one, or iterate over the whole folder converting the names as you check as there is no function to return all types of a string that it could possibly be.

Boudewijn_Krijger · February 27, 2022, 7:49pm

@anon20074439 and @Rick_Araujo Thanks for your help. With this I have something to work with. Much appreciated.

anon20074439 · February 27, 2022, 7:52pm

This is also all provided by icudt65.dll which is included with every built app with xojo but no-one has written a wrapper for that yet.

https://icu4c-demos.unicode.org/icu-bin/scompare

Rick_Araujo · February 27, 2022, 8:28pm

Xojo needs to expose such functionalities as part of its standard lib.

Michel_Bujardet · February 27, 2022, 9:45pm

In principle, Xojo Folderitem should find the file irrespective if it contains an accented character or not.

However, while Xojo is internally UTF-8, Windows file system is CP-1252 (encodings Windows ANSI).

That may play a role.

What do you get with Specialfolder.desktop.child(“fürmich.txt”).shellpath ?

TimStreater · February 27, 2022, 10:19pm

It probably can do, if the file was created on a Windows system. But if you put onto a Windows system a file that was created elsewhere, then it won’t find it even if the filename looks OK.