More about String with diacritics

Because good things is a never ending story, here’s another bug on the diacritics.

Our friends in Germany will understand:

System.DebugLog ReplaceAll(“Aldi Süed”,“ü”, “u”)

Does nothing. THe returned value is the source value.

Contrary to people belief, the two above ü are different: the first one comes from the Finder where I created a Folder, typed with my laptops (yes, 2 laptops) the sentence, pressed Return, then (and only then), I copied it to paste it in the above code.
The second ü was typed in Xojo IDE, with the same two laptops / same keyboard.

ReplaceAll change nothng and returned the string unchanged.

I do not want an explanation about why it is behaving like it is. I only want to demonstrate a bug.

What you can tell me is if ReplaceAll works with two Bytes OSes (a character coded with two bytes) like: arabic, Jew, Chineese, Japanese, all India languages, Thaï, and so on.

At last, one of the two Laptops w-runs El Capitan / Xojo 2015r1, the second runs Big Sur 11.3 / Xojo 2021r1.1.

Now, I have to search for a different workaround to do the (kostenloss) job.

Normalize the string as Composed before replacing it. You will find Unicode Normalization in my M_String project.

3 Likes

Like Kem says … It is a well-known issue that Apple in their perversity have chosen to use decomposed unicode in the file system so the letter ‘ü’ is represented as two code points, one for the ‘u’ and one for the umlaut, rather than one. Normalization restores sanity … Note that the issue is that there are two code points; the number of bytes has nothing to do with it.

2 Likes

Thank you Kem.

You mean that Aldi Süed in reality is Aldi Su¨ed ?

AppleScript command I ran in the Script Editor application with “Aldi Süed” copied from a folder in the Finder:
return clipboard info

Returned values:
{{«class utf8», 11}, {«class ut16», 22}, {string, 9}, {Unicode text, 20}}

With font JetBrain mono and good eyes (zoom-in for me), you can see that the ü are not the same:


The one from Finder is using an “u” and a “combining diaeresis” character, the one in Xojo is (or other Apple app like Notes) “latin small u with diaeresis

That’s Apple’s take, yes. While nearly everyone else thinks that treating ‘ü’ as a single code point makes more sense.

2 Likes

Not quite. The umlaut is combining, meaning that when rendered it is joined to the previous character. There must be a reason that combining characters are allowed, but perhaps it has to do with how some non-European written languages work. Anyway, the effect is that in Unicode, and thus UTF-8 (since one maps directly onto the other), there are two ways to represent ü.

So this is not Apple’s take, it’s Unicode’s take. It was Apple’s choice of which to use; I’d have used the other.

From what I tested, if you copy the ü from Finder it use the combining character (put the diaeresis back to the u) but if you type the same ü in Notepad it uses the u with diaeresis. So Apple decided to use one way for the file system and another for the apps.

2 Likes

A while ago, I verified that the Text type takes either combined or native accented characters. So it can be used to normalize a string containing either form of diacritics.

I remember that decades ago, some linguists were clamoring for diacritics getting their own code points in Unicode so you could freely combine base characters and diacritics. Unicode allows this but there is rarely a need to use this feature – certainly not when long established umlauted characters like ä, ü, or ö are involved. And nobody was forcing Apple to go for a decomposed representation for the file system and use a composed alternative elsewhere.

1 Like

There’s a Feedback that would go a long way to making this issue moot in Xojo.

<https://xojo.com/issue/58838>

Indeed, string does not treat composed diacritics the same as accented characters. But as I posted above, Text does treat both the same way. Effectively making the issue moot.

I’ve deprecated Text. Taken it out of my app entirely.

Ditto.

This simple test was rejected:

Dim Item_Name As Text // Was: String

Item_Name = f.Name
foo = Left(Item_Name,5)

?

Also:

https://documentation.xojo.com/api/deprecated/text.html says:
This item was deprecated in version 2021r1.
Please use String as a replacement.

Ce n’est pas que cela me gène, mais…

Text indeed is deprecated. However, it is the only way in Xojo to convert combined accented characters into unique diacritic characters. Until string truly equates Text - if ever- using Text in a method to convert combined to unique character is the best solution.

The other way would be to use a replacement table, where ¨u is replaced by ü and so on. A series of ReplaceAll…

Hence, https://xojo.com/issue/58838

Until (if) that is addressed there are other solutions. Both MacOSLib and MBS have methods to normalize UTF-8, and above Kem has posted a link to his own free native Xojo module M_String to do the same.

1 Like