How to optimize ReplaceAll of pre/decomposed UTF8 Strings? Text is (too) slow.

I have a String that contains either precomposed or decomposed UTF8.
So characters such as Umlaute can be represented in two different ways:

  • U+00E4 LATIN SMALL LETTER A WITH DIAERESIS
  • U+0061 LATIN SMALL LETTER A, U+0308 COMBINING DIAERESIS

I’m trying to find a way to do quite some ReplaceAll’s ( -> ae, -> ue, ...) on such a String. I can’t know if it’s pre/decomposed, so it has to work for both.

There are quite some forum posts here about that. What I understand is that String doesn’t really cope with that. The suggestions are to use Text instead. But… :wink:

Here’s an example project.
It has a mode that contains the Replacements I need/want. It obviously tries to replace much more than what’s in the example String. To see how it behaves when being called often, the example string is being replaced 10’000x. And there are two more replacements for both pre/decomposed UTF8 String values.
The expected result in the example is to get Jrg Rss replaced to Juerg Raess for all 3 Strings.

I’ve tried to do that with both String and Text.

  • String: obviously doesn’t work as expected
  • Text: expected result, but much slower (macOS: 0.09s <> 1.25s) (Windows: 0.14s <> 18.6s)
    While ReplaceAll with Text on macOS is still acceptable (even though 10x slower), it’s unbearably slow on Windows…

I’ve tried a couple of things (you’ll notice the Regex button), but haven’t found a way that:

  • works with both pre/decomposed strings
  • and is fast (especially on TargetWindows -> 18s is far too much for this example)

So… any ideas? What other approaches should be tried (no Plugins, please)?
Anyone curious to try to improve this example (so that it works on all macOS, Windows, Linux)?

Have you tried using ReplaceAllB on Strings, twice, with both binary representations of the Umlauts? After the Replace calls, you’d use DefineEncoding to set it back to UTF8.

You are just replacing one type with String, you need to replace both types.

I added the other ü to your code, and I get this:

Why use ReplaceAllB and not ReplaceAll? (I see that Jürg’s code use ReplaceAllB)
I ask this question after reading Joe’s comment.

Did more tests, added the other to the “String” test and “Regex” test, now all 3 reports show the same final values “Juerg Raess”

Tested on Windows 10 (running on parallels on a 5.5 year old Macbook Pro, as remote debug):
String Time: 0.412s
Regex Time: 5.100s
Text Time: 43.515s

The code looks like this:

'Telefonbuch (Standard): -> ae psText = ReplaceAll(psText, "", "ae") psText = ReplaceAll(psText, "a?", "ae") psText = ReplaceAll(psText, "", "oe") psText = ReplaceAll(psText, "", "ue") psText = ReplaceAll(psText, "u?", "ue")
but in fact the are different and the too.

Because it needs to work with Case Sensitive, too.
Or have I missed something, and ReplaceAll (in the classic framework) can be used with Case Sensitivity?

That’s why I thought that maybe Regex can handle this (and cope with both type of pre/decomposed Strings in one go). I guess not?

Not yet. But that’s what I’m probably ending up with… I was hoping that there is some way/workaround to “normalize” the String, so that this is not necessary. After all, it will result in 2x the amount of ReplaceAllB’s, which makes the code 2x slower (as in most cases the Strings are precomposed).

But can we agree that using Text and ReplaceAll is not a good idea (even though the result is exactly what’s needed), as this is far too slow? I guess that’s even worth a Feedback for it being so slow on Windows…

Does anyone know how to generate the list? Or is there some reference available somewhere where I can look it up?

I think I didn’t make myself clear. I was able to make it work with String and it was fast, maybe this is more clear (this works too):

'Telefonbuch (Standard): -> ae myString = "a" + Chr(&h0308) psText = ReplaceAllB(psText, myString, "ae") psText = ReplaceAllB(psText, "", "ae") psText = ReplaceAllB(psText, "", "oe") myString = "u" + Chr(&h0308) psText = ReplaceAllB(psText, myString, "ue") psText = ReplaceAllB(psText, "", "ue")

I guess it will be slower than your original code, but it does compare with both different cases (it needs to do it with String).

Until Text option is speed up, your best bet is to use String and double the ReplaceAll.

Note: where did I get the “other” and ?
I put ‘Break’ on PushButton1 Action after Dim s2, check the value of s2 at that point, copy the text value for and and create the new ReplaceAll with that value. Xojo kept the different / value to make it work even when you can see the same character on screen in fact it was a different value and make your code work with String (s2 is correctly changed from / to ae just like Text)

Edit: Oh, case sensitive, that makes sense. I changed to and to and String (with ReplaceAll) change to ae, if I use ReplaceAllB then is kept with String

Edit2: Now I see your case sensitive code, so I changed it to make it work:

If (pbApplyUpperCases) Then myString = "A" + Chr(&h0308) psText = ReplaceAllB(psText, myString, "Ae") psText = ReplaceAllB(psText, "", "Ae") psText = ReplaceAllB(psText, "", "Oe") myString = "U" + Chr(&h0308) psText = ReplaceAllB(psText, myString, "Ue") psText = ReplaceAllB(psText, "", "Ue") end if
Now it changes to Ae and to Ue

Did some tests with newer MacBook Pro, for String, added code for ö and Ö (missing from above):

Windows 10 running in virtualbox:

  • only replacing (Chr(228) for example) = 0.15s
  • also replacing (“a” + Chr(&h0308)) for a, o, u and A, O, U) = 0.19s

This Mac:

  • original code: 0.09s
  • with extra ReplaceAllB for the other character options: 0.11s

I think this will work for you. No need for Text.

I was thinking about this:

I guess U+0308 is a special code that make the diaeresis show over the previous character.

So how about using InStr(s2, Chr(&h0308)) (or InStrB) to avoid extra ReplaceAllB? If InStr result is 0 then there are no “combining diaeresis” in the string, so no need to use:

myString = "A" + Chr(&h0308) psText = ReplaceAllB(psText, myString, "Ae") // and for O, U,...

Will that help a little or am I still not getting it?

Be careful if you use Text and Windows (not only for speed): <https://xojo.com/issue/52144>

Could this bug be the reason that in Windows is much slower than in Mac?

You are.
The thing is that there are a ton of these “combining characters”. And I need to look them up first, and combine them all.
But I guess that’s the only option - for now…

Text datatype is really not an option. Because of both Speed and Bug(s) :wink:

That’s what I was trying to say, you can “make it work” with String but you need to add a lot more code to test all the “combining characters” that you may have/receive.