É, û and… é conversions (!)

Emile_Schwarz · November 14, 2024, 3:16pm

I have data stored inside the file names of certain files. In these data, I have dates in the French format:
janvier, février, mars, avril, mai, juin, juillet, août, septembre, octobre, novembre and décembre.

Have-you seen the diacritics ? I cannot extract the date to an object when the name of the month is février, août or décembre whatever test I think at… if month = “février” fails.

Idea is welcome.

AlbertoD · November 14, 2024, 3:22pm

I don’t remember the correct names but sometimes you see é and is not the same as other é. One is a single character and the other is a combination of e and '.

My guess is that this is happening to you. But you need to give more information than “fails”.

Also, it may be a problem of string encoding from your data. Can you check what the encoding is and the Hex for “février” just before you do the if (compare that with a test string = “février”)

Eric_Williams · November 14, 2024, 3:26pm

The terms are “composed” and “decomposed” - essentially, a composed character in Unicode means that a single character represents the è, while in a decomposed situation the è is stored as an e plus the `.

They should compare to equal when using the standard string functions, though. Don’t compare the hex values - they will indeed differ between composed and decomposed strings.

Have you tried StrComp instead of = ?

Emile_Schwarz · November 14, 2024, 3:32pm

These month names are typed from the Finder (macOS).

From a user point of view, nothing fancy, but… and other text is taken as is, so it is displayed correctly.

What do I mean by fail ?

When I use:
If month = "février" then mn = 2

mn is never set. The compare does not works, just like if I ask:
If month = "f|vrier" then mn = 2

Sorry. I do not know how to be more precise on the matter.

I short, I cannot convert a Date String “24 février 2024” to a Date Object 2024-02-24" because of the diacritic; no trouble for the other 9 months names.

AlbertoD · November 14, 2024, 3:43pm

You need to check if "février" is the same "février" in month, if not, then using = will not work. You can try StrComp as mentioned by Eric.

You can also check DateTime.FromString, it may help with your conversion (but depends on the encoding/composed/decomposed of "février".

Rick_Araujo · November 14, 2024, 4:01pm

Obtain such string from the system in a sample of code, store the name you get in a string var like fileStr, break, and inspect the filestr, Binary tab. What encoding it is showing? UTF, Nil, other? What are the HEX codes there?

Thomas_ROBISSON · November 14, 2024, 4:41pm

Someone (here on the forum) gave me the solution some times ago and I wrote a Method:

Public Function NrmStgEnc(Extends CeTexte as String, CeForm as UInt32 = -128) As String
  
  Dim TpEncod as TextEncoding ' Cette Method normalise l'encodage d'une String. Un "é" (1 digit) peut aussi être encodé "´e" (2 digits)
  
  If not(CeTexte = "") Then ' Depuis Catalina il y a des problème avec les encodages des accents. Un fichier nommé "Téiök" risque de retourner faux au test : (Fich.Name = "Téiök")
    TpEncod = CeTexte.Encoding
    If not(TpEncod = Nil) Then ' Par contre il n'y a pas de problème quand je fais  .Child(Fich.Name)  donc dans ce cas je ne normalise pas l'encodage
      #IF TargetMacOS Then ' https://forum.xojo.com/t/accentuated-characters-in-name-of-folderitem-catalina/53892/18
        If CeForm = -128 Then CeForm = 2 ' Sous Mac il faut utiliser 2
        Declare Function CFStringCreateMutableCopy Lib "Foundation" (alloc as Ptr, maxLength as UInt32, TheString as CFStringRef) as CFStringRef
        Declare Sub CFStringNormalize Lib "Foundation" (TheString as CFStringRef, TheForm as UInt32)
        
        Dim mutableStringRef as CFStringRef = CFStringCreateMutableCopy(Nil, 0, CeTexte) ' Inutile : CeTexte.ConvertEncoding(Encodings.UTF8))
        
        CFStringNormalize(mutableStringRef, CeForm)
        ' CFStringNormalize mutableStringRef, 2 ' Dans un exemple c'était 2, dans l'autre il passait  CeForm  en paramètre
        CeTexte = mutableStringRef
        
        ' Avant je mettais l'un des encodages ci-dessous et ça corrigeait le problème (j'avais réactivé la ligne pour test)
        ' CeTexte = CeTexte.ConvertEncoding(Encodings.UTF32LE) ' Encodings.UTF32BE ' Encodings.UTF16BE ' Encodings.UTF16LE ' Encodings.MacRoman
        ' En mettant un des encodages ci-dessous ça ne corrigeait pas le problème
        ' CeTexte = CeTexte.ConvertEncoding(Encodings.UTF32) ' Encodings.UTF8 ' Encodings.UTF16
        CeTexte = CeTexte.ConvertEncoding(TpEncod) ' return "" if there was an error , je pourrais le mettre en  DefAppEncod
        
      #ElseIf TargetWindows Then ' https://forum.xojo.com/t/special-character-folder-item-child-encoding/68751/35
        #If False Then ' Ca ne marche pas, ça me fait merder  GetFitemAbsPath
          If CeForm = -128 Then CeForm = 1 ' Sous Windows il faut utiliser 1
          Declare Function NormalizeString Lib "Normaliz.dll" Alias "NormalizeString" ( NormForm as Int32, lpSrcString as WString, cwSrcLength as Int32, pDstString as Ptr, cwDstLength as Int32 ) as Int32
          
          Dim StringLength as Int32 = NormalizeString(CeForm, CeTexte, CeTexte.Length, Nil, 0)
          Dim NormalisedString as New MemoryBlock(StringLength) ' see https://docs.microsoft.com/en-us/windows/win32/api/winnls/ne-winnls-norm_form for values for NormForm
          Dim TampNbre as Int32 = NormalizeString(CeForm, CeTexte, CeTexte.Length, NormalisedString, StringLength)
          If TampNbre > 0 Then
            CeTexte = NormalisedString.WString(0).ConvertEncoding(TpEncod) ' return with the encoding the same as we received it
          Else
            CeTexte = CType("", String).ConvertEncoding(TpEncod) ' return "" if there was an error , je pourrais le mettre en  DefAppEncod
          End If
        #EndIf
        
      #EndIf
    End If
  End If
  
  Return CeTexte
  
End Function

Then you do:
MyStringWithNormEncod = MyStringWithoutNormEncod.NrmStgEnc
I didn’t test if finaly MyStringWithNormEncod contains only é on 1 digit or on 2 digits, but if you do that to all your string they will encode the é è à ç the same way.

Christian_Wheel · November 15, 2024, 1:02am

If you have MBS there’s a great method called RemoveAccentsMBS that will normalize a string to the non-accented version. You can then compare to that.