I have a String that contains either precomposed or decomposed UTF8.
So characters such as Umlaute can be represented in two different ways:
- U+00E4 LATIN SMALL LETTER A WITH DIAERESIS
- U+0061 LATIN SMALL LETTER A, U+0308 COMBINING DIAERESIS
I’m trying to find a way to do quite some ReplaceAll’s (
-> ae, -> ue, ...) on such a String. I can’t know if it’s pre/decomposed, so it has to work for both.
There are quite some forum posts here about that. What I understand is that
String doesn’t really cope with that. The suggestions are to use
Text instead. But…
Here’s an example project.
It has a mode that contains the Replacements I need/want. It obviously tries to replace much more than what’s in the example String. To see how it behaves when being called often, the example string is being replaced 10’000x. And there are two more replacements for both pre/decomposed UTF8 String values.
The expected result in the example is to get
Jrg Rss replaced to
Juerg Raess for all 3 Strings.
I’ve tried to do that with both
String: obviously doesn’t work as expected
Text: expected result, but much slower (macOS: 0.09s <> 1.25s) (Windows: 0.14s <> 18.6s)
While ReplaceAll with Text on macOS is still acceptable (even though 10x slower), it’s unbearably slow on Windows…
I’ve tried a couple of things (you’ll notice the Regex button), but haven’t found a way that:
- works with both pre/decomposed strings
- and is fast (especially on TargetWindows -> 18s is far too much for this example)
So… any ideas? What other approaches should be tried (no Plugins, please)?
Anyone curious to try to improve this example (so that it works on all macOS, Windows, Linux)?