Because good things is a never ending story, here’s another bug on the diacritics.
Our friends in Germany will understand:
System.DebugLog ReplaceAll(“Aldi Süed”,“ü”, “u”)
Does nothing. THe returned value is the source value.
Contrary to people belief, the two above ü are different: the first one comes from the Finder where I created a Folder, typed with my laptops (yes, 2 laptops) the sentence, pressed Return, then (and only then), I copied it to paste it in the above code.
The second ü was typed in Xojo IDE, with the same two laptops / same keyboard.
ReplaceAll change nothng and returned the string unchanged.
I do not want an explanation about why it is behaving like it is. I only want to demonstrate a bug.
What you can tell me is if ReplaceAll works with two Bytes OSes (a character coded with two bytes) like: arabic, Jew, Chineese, Japanese, all India languages, Thaï, and so on.
At last, one of the two Laptops w-runs El Capitan / Xojo 2015r1, the second runs Big Sur 11.3 / Xojo 2021r1.1.
Now, I have to search for a different workaround to do the (kostenloss) job.
Like Kem says … It is a well-known issue that Apple in their perversity have chosen to use decomposed unicode in the file system so the letter ‘ü’ is represented as two code points, one for the ‘u’ and one for the umlaut, rather than one. Normalization restores sanity … Note that the issue is that there are two code points; the number of bytes has nothing to do with it.
Not quite. The umlaut is combining, meaning that when rendered it is joined to the previous character. There must be a reason that combining characters are allowed, but perhaps it has to do with how some non-European written languages work. Anyway, the effect is that in Unicode, and thus UTF-8 (since one maps directly onto the other), there are two ways to represent ü.
So this is not Apple’s take, it’s Unicode’s take. It was Apple’s choice of which to use; I’d have used the other.
From what I tested, if you copy the ü from Finder it use the combining character (put the diaeresis back to the u) but if you type the same ü in Notepad it uses the u with diaeresis. So Apple decided to use one way for the file system and another for the apps.
I remember that decades ago, some linguists were clamoring for diacritics getting their own code points in Unicode so you could freely combine base characters and diacritics. Unicode allows this but there is rarely a need to use this feature – certainly not when long established umlauted characters like ä, ü, or ö are involved. And nobody was forcing Apple to go for a decomposed representation for the file system and use a composed alternative elsewhere.
Text indeed is deprecated. However, it is the only way in Xojo to convert combined accented characters into unique diacritic characters. Until string truly equates Text - if ever- using Text in a method to convert combined to unique character is the best solution.
The other way would be to use a replacement table, where ¨u is replaced by ü and so on. A series of ReplaceAll…
Until (if) that is addressed there are other solutions. Both MacOSLib and MBS have methods to normalize UTF-8, and above Kem has posted a link to his own free native Xojo module M_String to do the same.