Stripping accents from my string

Emile_Schwarz · March 5, 2021, 12:24am

In a project, I batch process (resize) image and save them.

So far, so good.

But, there are accentuated characters in their file names and I want to remove them (the images files are used in a html document) in a DESKTOP application.

As an example, the code below returns Spe?cial when I have Spécial:

ConvertEncoding("Spécial", Encodings.ASCII)

What I want to get is Special.

A search in the Internet leads to nowhere… (excepted a waste of time).

Kem_Tekinay · March 5, 2021, 1:58am

I get “Special”, so I can’t reproduce your results.

var s as string = "Spécial"
s = s.ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )
// Special

Robert_Weaver · March 5, 2021, 2:44am

You’ll probably get different results depending on whether the ‘é’ is a single character U00E9 or a compound character consisting of an acute diacritical U0301 combining character plus a regular ‘e’ U0065.

Emile_Schwarz · March 5, 2021, 4:03am

I do not know, I typed it with the keyboard

Emile_Schwarz · March 5, 2021, 4:07am

Thank you Kem,

Probably a bug. I need to rest, I will explorate this in the morning.

Tim_Hare · March 5, 2021, 6:43am

I hate it when my strings speak in a strange accent. (Sorry, couldn’t help it.)

Jean-Yves_Pochez · March 5, 2021, 9:11am

Emile, you should go live in the USA you would be in better phase with the sun !

Emile_Schwarz · March 5, 2021, 4:49pm

No sun today & Strasbourg (Europe).

Yes, it was late in the night; in fact I slept from 6 to 13…

… and I totally forgot about that problem

Christian_Schmitz · March 5, 2021, 5:18pm

See also RemoveAccentsMBS function in MBS Xojo Plugins.

Emile_Schwarz · March 6, 2021, 9:51pm

For: Spécial Le Fantôme

I get:
Spe?cial_Le_Fanto^me

I replaced spaces with underscores.
Used code:

TF_Character.Text = Char_Name.ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )

Used Xojo: 2015r1 AND 2020r2 ( I do not downloaded Xojo 2.1, yet).
Also tested with High Sierra.

I may modify my code to ask the user for the data I need (its late, I do not recall what other modification I need to do nor how I will do it/them).

Last information: the project takes Char_Name from a file (or folder) name (I created it).

Rick_Araujo · March 6, 2021, 10:36pm

Seems that ConvertEncoding is not taking decomposed chars into account. Unicode have 2 ways to represent the same chars, like the composite “é” or the decomposed “e”+<<special code add “´” to the last char>> that seems the case Emile is bringing above. So Xojo needed something like

s = s.ConvertComposite().ConvertEncoding( Encodings.ASCII ).ConvertEncoding( Encodings.UTF8 )

The supposed ConvertComposite() would find find composition sequences and convert to the composite chars to avoid such problems.

Jim_Meyer · March 6, 2021, 11:33pm

I had this code laying around for "normalizing a string on Mac… It might help:

#if targetMacOS
soft declare function CFStringCreateMutableCopy lib “Carbon.framework” (alloc as Ptr, maxLength as UInt32, theString as CFStringRef) as CFStringRef

dim mutableStringRef as CFStringRef = CFStringCreateMutableCopy(nil, 0, s)

soft declare sub CFStringNormalize lib “Carbon.framework” (theString as CFStringRef, theForm as UInt32)

CFStringNormalize mutableStringRef, form
return mutableStringRef
#else
return s
#endif

'enum CFStringNormalizationForm {
'kCFStringNormalizationFormD = 0,
'kCFStringNormalizationFormKD = 1,
'kCFStringNormalizationFormC = 2,
'kCFStringNormalizationFormKC = 3

where “s” is the string and “form” is UInt32

Emile_Schwarz · March 7, 2021, 12:58am

It works if it is a single é typed from the keybard…

Sorry Kem.

Taking a nap helps, sometimes.

Tim_Hare · March 7, 2021, 3:14am

Is the encoding defined? Otherwise ConvertEncoding is bound to fail.

Emile_Schwarz · March 7, 2021, 6:27am

Good question. The answer is: I do not know. I get the data using Char_Name = f.Name.

Is there an encoding in that case ?
UTF8 ?

Peter_Koren · March 7, 2021, 12:20pm

RemoveAccentsMBS works great. It can remove accents from all Slovak and Czech characters correctly.

Rick_Araujo · March 7, 2021, 3:15pm

As I said, same accented chars can have 2 different “codepoint compositions”, one is one char with accent codepoint (composite), and another is a char followed by a “add this accent” to the previous char (decomposed), 2 codepoints. The second utf-8 sequence seems the problem Emile is seeing.

“é” can be &u00e9 or
“e+´” → é → &u0065+&u0301

Block U+0300 combining diacritical marks:

Emile_Schwarz · March 8, 2021, 10:37am

you are correct: é is sored in 2 bytes… seen in the debugger.

This may explain a lot of strange things I saw those last years !

é (MacRoman) <> é (UTF8) !

Comparing a string typed with the keyboard to a string from f.Name never leads to equals even if they are the same (I typed the file name from… the keyboard).

TimStreater · March 8, 2021, 12:27pm

Careful. If what you see is C3 A9, that is ONE UTF-8 character, é (2 bytes total).

If what you see is 65 followed by CC 81, that is TWO UTF-8 characters, the ‘e’ followed by the ‘combining acute accent’ (3 bytes total).

Jonathan_Ashwell · March 8, 2021, 12:51pm

There is an old Feedback request to allow comparing strings with “canonical equivalence” (that is, normalize the UTF-8 before comparison) to avoid just cases where two strings with the same encoding that are identical to the human eye are considered to be different.

<https://xojo.com/issue/58838>