Stripping accents from my string

In a project, I batch process (resize) image and save them.

So far, so good.

But, there are accentuated characters in their file names and I want to remove them (the images files are used in a html document) in a DESKTOP application.

As an example, the code below returns Spe?cial when I have Spécial:

ConvertEncoding("Spécial", Encodings.ASCII)

What I want to get is Special.

A search in the Internet leads to nowhere… (excepted a waste of time).

I get “Special”, so I can’t reproduce your results.

var s as string = "Spécial"
s = s.ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )
// Special
1 Like

You’ll probably get different results depending on whether the ‘é’ is a single character U00E9 or a compound character consisting of an acute diacritical U0301 combining character plus a regular ‘e’ U0065.

I do not know, I typed it with the keyboard :wink:

Thank you Kem,

Probably a bug. I need to rest, I will explorate this in the morning.

I hate it when my strings speak in a strange accent. :rofl: (Sorry, couldn’t help it.)

Emile, you should go live in the USA you would be in better phase with the sun !

No sun today & Strasbourg (Europe).

Yes, it was late in the night; in fact I slept from 6 to 13…

… and I totally forgot about that problem :frowning:

See also RemoveAccentsMBS function in MBS Xojo Plugins.

For: Spécial Le Fantôme

I get:
Spe?cial_Le_Fanto^me

I replaced spaces with underscores.
Used code:

TF_Character.Text = Char_Name.ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )

Used Xojo: 2015r1 AND 2020r2 ( I do not downloaded Xojo 2.1, yet).
Also tested with High Sierra.

I may modify my code to ask the user for the data I need (its late, I do not recall what other modification I need to do nor how I will do it/them).

Last information: the project takes Char_Name from a file (or folder) name (I created it).

Seems that ConvertEncoding is not taking decomposed chars into account. Unicode have 2 ways to represent the same chars, like the composite “é” or the decomposed “e”+<<special code add “´” to the last char>> that seems the case Emile is bringing above. So Xojo needed something like

s = s.ConvertComposite().ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )

The supposed ConvertComposite() would find find composition sequences and convert to the composite chars to avoid such problems.

I had this code laying around for "normalizing a string on Mac… It might help:

#if targetMacOS
soft declare function CFStringCreateMutableCopy lib “Carbon.framework” (alloc as Ptr, maxLength as UInt32, theString as CFStringRef) as CFStringRef

dim mutableStringRef as CFStringRef = CFStringCreateMutableCopy(nil, 0, s)

soft declare sub CFStringNormalize lib “Carbon.framework” (theString as CFStringRef, theForm as UInt32)

CFStringNormalize mutableStringRef, form
return mutableStringRef
#else
return s
#endif

'enum CFStringNormalizationForm {
'kCFStringNormalizationFormD = 0,
'kCFStringNormalizationFormKD = 1,
'kCFStringNormalizationFormC = 2,
'kCFStringNormalizationFormKC = 3

where “s” is the string and “form” is UInt32

It works if it is a single é typed from the keybard…

Sorry Kem.

Taking a nap helps, sometimes.

Is the encoding defined? Otherwise ConvertEncoding is bound to fail.

Good question. The answer is: I do not know. I get the data using Char_Name = f.Name.

Is there an encoding in that case ?
UTF8 ?

RemoveAccentsMBS works great. It can remove accents from all Slovak and Czech characters correctly.

As I said, same accented chars can have 2 different “codepoint compositions”, one is one char with accent codepoint (composite), and another is a char followed by a “add this accent” to the previous char (decomposed), 2 codepoints. The second utf-8 sequence seems the problem Emile is seeing.

“é” can be &u00e9 or
“e+´” → é → &u0065+&u0301

Block U+0300 combining diacritical marks:

you are correct: é is sored in 2 bytes… seen in the debugger.

This may explain a lot of strange things I saw those last years !

é (MacRoman) <> é (UTF8) !

Comparing a string typed with the keyboard to a string from f.Name never leads to equals even if they are the same (I typed the file name from… the keyboard).

Careful. If what you see is C3 A9, that is ONE UTF-8 character, é (2 bytes total).

If what you see is 65 followed by CC 81, that is TWO UTF-8 characters, the ‘e’ followed by the ‘combining acute accent’ (3 bytes total).

1 Like

There is an old Feedback request to allow comparing strings with “canonical equivalence” (that is, normalize the UTF-8 before comparison) to avoid just cases where two strings with the same encoding that are identical to the human eye are considered to be different.

<https://xojo.com/issue/58838>

1 Like