In a project, I batch process (resize) image and save them.
So far, so good.
But, there are accentuated characters in their file names and I want to remove them (the images files are used in a html document) in a DESKTOP application.
As an example, the code below returns Spe?cial when I have Spécial:
ConvertEncoding("Spécial", Encodings.ASCII)
What I want to get is Special.
A search in the Internet leads to nowhere… (excepted a waste of time).
You’ll probably get different results depending on whether the ‘é’ is a single character U00E9 or a compound character consisting of an acute diacritical U0301 combining character plus a regular ‘e’ U0065.
Seems that ConvertEncoding is not taking decomposed chars into account. Unicode have 2 ways to represent the same chars, like the composite “é” or the decomposed “e”+<<special code add “´” to the last char>> that seems the case Emile is bringing above. So Xojo needed something like
s = s.ConvertComposite().ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )
The supposed ConvertComposite() would find find composition sequences and convert to the composite chars to avoid such problems.
I had this code laying around for "normalizing a string on Mac… It might help:
#if targetMacOS
soft declare function CFStringCreateMutableCopy lib “Carbon.framework” (alloc as Ptr, maxLength as UInt32, theString as CFStringRef) as CFStringRef
dim mutableStringRef as CFStringRef = CFStringCreateMutableCopy(nil, 0, s)
soft declare sub CFStringNormalize lib “Carbon.framework” (theString as CFStringRef, theForm as UInt32)
CFStringNormalize mutableStringRef, form
return mutableStringRef #else
return s #endif
As I said, same accented chars can have 2 different “codepoint compositions”, one is one char with accent codepoint (composite), and another is a char followed by a “add this accent” to the previous char (decomposed), 2 codepoints. The second utf-8 sequence seems the problem Emile is seeing.
you are correct: é is sored in 2 bytes… seen in the debugger.
This may explain a lot of strange things I saw those last years !
é (MacRoman) <> é (UTF8) !
Comparing a string typed with the keyboard to a string from f.Name never leads to equals even if they are the same (I typed the file name from… the keyboard).
There is an old Feedback request to allow comparing strings with “canonical equivalence” (that is, normalize the UTF-8 before comparison) to avoid just cases where two strings with the same encoding that are identical to the human eye are considered to be different.