I have a String that contains either precomposed or decomposed UTF8.
So characters such as Umlaute can be represented in two different ways:
- U+00E4 LATIN SMALL LETTER A WITH DIAERESIS
- U+0061 LATIN SMALL LETTER A, U+0308 COMBINING DIAERESIS
I’m trying to find a way to do quite some ReplaceAll’s ( -> ae, -> ue, ...
) on such a String. I can’t know if it’s pre/decomposed, so it has to work for both.
There are quite some forum posts here about that. What I understand is that String
doesn’t really cope with that. The suggestions are to use Text
instead. But…
Here’s an example project.
It has a mode that contains the Replacements I need/want. It obviously tries to replace much more than what’s in the example String. To see how it behaves when being called often, the example string is being replaced 10’000x. And there are two more replacements for both pre/decomposed UTF8 String values.
The expected result in the example is to get Jrg Rss
replaced to Juerg Raess
for all 3 Strings.
I’ve tried to do that with both String
and Text
.
-
String
: obviously doesn’t work as expected -
Text
: expected result, but much slower (macOS: 0.09s <> 1.25s) (Windows: 0.14s <> 18.6s)
While ReplaceAll with Text on macOS is still acceptable (even though 10x slower), it’s unbearably slow on Windows…
I’ve tried a couple of things (you’ll notice the Regex button), but haven’t found a way that:
- works with both pre/decomposed strings
- and is fast (especially on TargetWindows -> 18s is far too much for this example)
So… any ideas? What other approaches should be tried (no Plugins, please)?
Anyone curious to try to improve this example (so that it works on all macOS, Windows, Linux)?