Stripping accents from my string

Rick_Araujo · March 8, 2021, 2:43pm

Not the best approach. Better having String.UnicodeComposed() and String.UnicodeDecomposed() and let the dev to pre-compose/decompose an unicoded string and store things as he wish without adding extra overhead at comparisons at run time, all the time.

Emile_Schwarz · March 8, 2021, 4:47pm

What can do the developer with native Xojo in the mean time ?

Rick_Araujo · March 8, 2021, 4:52pm

Use 3rd party solutions.

Emile_Schwarz · March 8, 2021, 5:57pm

It’s a mantra; repeat after me:
“Native Xojo solution”
ad infinitum
end repeat

;-

Tim_Parnell · March 8, 2021, 6:05pm

You could build a dictionary of things to swap and replace like Garry/Thom’s HTMLEncode module does.

Jim_Meyer · March 8, 2021, 8:02pm

The declare code I posted above (2 days ago) will convert a UTF8 hex 65CC81 “é” (decomposed/unnormalized) into a UTF8 hex C3A9 “é” (composed/normalized).

Call the function with the “form” set to kCFStringNormalizationFormC = 2

65CC81 is a 65 (plain “e”) along with a CC81 which is the “combining acute accent” character. see: U+0301 COMBINING ACUTE ACCENT – Codepoints

For Windows see: String.Normalize Method (System) | Microsoft Docs

Isn’t this what you are looking for?

Kem_Tekinay · March 8, 2021, 8:15pm

Are we sure the results will be consistent in all cases across platforms? (I need to implement this here soon, but need something that will return the same result on Mac, Windows, and Linux.)

Jim_Meyer · March 8, 2021, 8:35pm

The only place I have run into this problem is with Mac file/folder names… but that does not mean it does not happen in other situations.

On a Mac if you create folder using the Finder then rename it using keystrokes option-e e to get an é in the file name it will be decomposed (65CC81)… which is somewhat understandable as I used 2 key strokes… but I did see the same problem under Windows… and I am not sure what would happen if you used a non-English keyboard that has an “é” key.

Kem_Tekinay · March 8, 2021, 8:37pm

Sorry, I meant, are we sure the normalization methods do the same thing in all cases across platforms?

Rick_Araujo · March 8, 2021, 10:46pm

Unicode composition/decomposition are String methods that Xojo lacks. Create a feature request.
Xojo uses libicu AFAIK, and libicu have all the tools for it.

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html

Thomas_ROBISSON · March 9, 2021, 8:39am

I didn’t search but in other topic, someone gave the same Method as Jim Meyer except he use Lib “Foundation” instead of lib “Carbon.framework” :

Soft Declare Function CFStringCreateMutableCopy Lib "Foundation" (alloc as Ptr, maxLength as UInt32, TheString as CFStringRef) as CFStringRef
Soft Declare Sub CFStringNormalize Lib "Foundation" (TheString as CFStringRef, TheForm as UInt32)

Rick_Araujo · March 9, 2021, 1:46pm

Doesn’t work for Windows or Linux or Mobile… Libicu is the xplat standard way (so, maybe Foundation use a port of it behind the scenes), Android, for instance, already contains it. Xojo have it in its lib folder too. Most linux already contains it too and it is one install command away from those who rarely doesn’t.

Emile_Schwarz · March 9, 2021, 5:32pm

A crazy idea comes to my mind minutes ago: what if I set the Encoding to Nil ?

Dim Img_Prefix As String

Img_Prefix = ConvertEncoding(Char_Name, Nil)

Select Case Img_Prefix
  
Case ConvertEncoding("Spécial le Fantôme", Nil)

The ConvertEncoding part compiles and apparently works, but the Case above is skipped instead of fired…

Nota: this is bad code; I created it only to check the result.

Nota #2: using the file name in an html document works fine *; I just tested that yesterday. But I prefer using clear ASCII file names instead and keep human viewable strings for displaying purposes.
That is the “what will thing an user if he looks at that” syndrom…

If a good solution is not found when that project ends, I will use brute force comparison/strip/replace characters.

I use that string as an image file name (I want it here as pure ASCII) AND it is displayed (unchanged) to the screen in a table row (html).

Emile_Schwarz · March 10, 2021, 7:25am

I checked what NativePath returns and…

e%CC%81 for ‘é’ (cc81 stands for acute)
o%CC%82 for ‘ô’ (CC82 stands for circ)
(and CC80 for grave)

In a Select block, with some Case lines, I can build something that convert to my needs Spécial to special, and so on and add later the eventual ones I can forget now…

Kem_Tekinay · March 11, 2021, 12:27am

I’ve added ToNormalized to my M_String module (not posted yet) and it’s working for Mac and Windows. Anyone have the Linux code for this?

Kem_Tekinay · March 11, 2021, 6:52am

I solved it for Linux, and my solution is either impressive or horrifying.

I used JavaScript to create a JSON object that linked every composed character to its decomposed version, and another that linked every compatible character to its equal. I then pasted that into a constant and used the resulting Dictionaries to do the transformations.

Seems to work, an I only have to include that constant for Linux.

Thomas_ROBISSON · March 11, 2021, 7:54am

A U is missing in UTF8 UTF16 etc. . The U is for Universal and I’m dreaming of a UTFU8 where the second U would be Uniform.

How can we find your module Kem ?

Kem_Tekinay · March 11, 2021, 8:54am

I’ll post it to my site when ready, then post a notice here on the forum. Stay tuned…

Emile_Schwarz · March 11, 2021, 9:32am

I found it is incredible that no one can code:

If f.Name = "émile schwarz" Then
   MsgBox "It’s me !"
End If

What I’ve done so far is to display the relevant part of the file name in a TextField and the user have to modify it, then click in a Process PushButton.

Works fine once on some (because I still forgot to modify the TextField Contents before pressing Process)…

I am still adding code to the process Method (I added a Table of Contents once the html file was fully generated); next: add a button to go back to the start of the document.

BTW: yes, I have two methods that need to strip the diacritical from the images files name:
a. the one who made a batch resize of images, (magazine covers),
b. the one who organize these resized images in a long <table html document.

The generated html document display the magazine cover and the Table of Contents for each magazine (thus link forward and backward).

That is my need; many other may fall into that trap with different needs ;-

I was atching TV about COVID 19 and my brain was thinking at…:

With the above, why must I set an encoding when I read a Text File with TextInputStream ?
(Yes, this is different, here I read the name of the file, not the file)…

Tell to a 6 y/o kid who learn how to develop an application:
You cannot use a file name in your program for comparisons, but you can do whatever you want with its text contents. *

I started this program using a Text file where I pasted the cover images file names until I realized that step was useless (and a loss of time; bad thinking):
“Émile, take that information directly from the images file names !”

This is more efficient: you do not have to modify the text file contents each time you add a cover…not much time, but error prone.

Emile_Schwarz · March 13, 2021, 11:45am

I fall into the same trap, different code.

because I use API1, I have to parse a date as string by myself. In french, month names holds é,û and é…

In the current project, my dates were… wrong (crazy I must say): 0002-11-30 (nothing related to what’s in the file names.

After a long quest, I realized the presence of these two characters in three months names. I changed février to fevrier and that date goes fine… 27 fevrier 1961 → 1961-02-27.

But I cannot change all involved dates because they are used as is elsewhere.

So I tried to be creative and found:

  If mName = "janvier" Then
    aMonth = 1
    
  ElseIf Left(mName,2) = "fe" Then
    aMonth = 2
    
  ElseIf mName = "mars" Then
    aMonth = 3
    
  ElseIf mName = "avril" Then
    aMonth = 4
    
  ElseIf mName = "mai" Then
    aMonth = 5
    
  ElseIf mName = "juin" Then
    aMonth = 6
    
  ElseIf mName = "juillet" Then
    aMonth = 7
    
  ElseIf Left(mName,2) = "ao" Then
    aMonth = 8
    
  ElseIf mName = "septembre" Then
    aMonth = 9
    
  ElseIf mName = "octobre" Then
    aMonth = 10
    
  ElseIf mName = "novembre" Then
    aMonth = 11
    
  ElseIf Left(mName,2) = "de" Then
    aMonth = 12
    
  Else
    aMonth = 0 // Will generate a wrong date !
    
  End If

Apparently, that code works fine.

PS: I used a bunch of If because n my Xojo, Left(mName,2) is rejected in a Case line…