Fast character access of a string

Rainer_Hofmann · September 3, 2021, 7:59pm

Thank you very much! However, how do I use code tags in the forum?

Kem_Tekinay · September 3, 2021, 8:04pm

There is an icon in the toolbar, but the easiest way is to use three backticks on their own lines before and after the code.

Rainer_Hofmann · September 3, 2021, 8:05pm

Ok, thanks!

Christian_Schmitz · September 3, 2021, 8:59pm

using Ptr instead of MemoryBock will be even faster.
e.g. ptr pointing to bytes in MemoryBlock.

But using string + array is more save as a wrong index with ptr can crash easily.

MarkusR · September 4, 2021, 5:17am

can you tell what is the task with this 5000000 rows csv file?
maybe you can import this file first into a sqlite database.
i remember people wrote it is very fast.

https://www.sqlite.org/download.html
https://www.sqlite.org/cli.html

Arnaud_N · September 4, 2021, 11:53am

It depends. This character isn’t available on my layout, so the button is preferable. And who knows how many layouts don’t have it?

Emile_Schwarz · September 4, 2021, 1:32pm

backticks = alt-shift-1 in my fr macOS…

Arnaud_N · September 4, 2021, 2:08pm

Well, I don’t think it’s worth enumerating all the existing layouts, as I rather think this character isn’t part of the “standard set”. But I can be wrong, of course.

Jean-Yves_Pochez · September 4, 2021, 2:57pm

Emile, you have it directly under the £ key on your french keyboard !

Rainer_Hofmann · September 4, 2021, 7:27pm

Sorry, I was not allowed to write more messages on my first day in the forum. Btw. the performance is better if using String.FromArray. But unfortunately, only about 4.1%. I was hoping for more because your explanation makes a lot sense for me. However, the solution should be much better if the strings are much longer I guess.
Thank you! There have been a lot of good suggestions. I will play a little bit with it and see what I can do.

Rainer_Hofmann · September 5, 2021, 1:20pm

Now I had time to look into more performance optimisation. My goal was to avoid string concatenation as much as possible. With using String.IndexOf for searching the right positions and building afterwards only new strings if needed I was able to get 6 times faster code compared to my first solution.

Thanks a lot for everybody who was interested in helping me to improve my first steps in Xojo.

Christian_Schmitz · September 5, 2021, 1:33pm

I wrote you a blog post:

Optimizing a Xojo function

Rainer_Hofmann · September 5, 2021, 2:08pm

Nice work, Christian, and very helpful!
As I already wrote, I have looked into a different approach with simply avoiding as many string concatenations as possible. Therefore, I have not to take care about utf8-strings as well.

My first attempt was using String.FromArray for concatenation but I got only a small percentage of better timings.

Mike_D · September 5, 2021, 2:11pm

Of note, Christian’s tests showed a potential 24000x (yes, twenty-four-thousand fold) speedup when using Ptr compared to using the Unicode-aware Xojo string functions such as Middle().

I wonder how much of this is overhead in the libraries that Xojo is using, vs. overhead in the Xojo framework itself?

Rainer_Hofmann · September 5, 2021, 2:19pm

A kind of pointer access was the first thing I was looking for because I know of similar operations in other languages like string[ptr]. Which is really fast but needs additional efforts to take care about unicode characters with different length.

Mike_D · September 5, 2021, 2:28pm

Not necessarily - If you are splitting strings, and the separator is always ASCII, then the beauty of UTF8 is you can just search for the separators as Uint8 (byte), and not worry about whether the strings between the separators are Unicode or not. It “just works”.

TimStreater · September 5, 2021, 2:41pm

Very interesting results. I moved to using split() to make an array of characters when looking through a text with a view to replacing some items That ended up being fast enough for me in practice.

Your complete set of tests is a useful reference - thanks for that.

Rainer_Hofmann · September 5, 2021, 2:44pm

However, I guess that works only if the second or third byte code of utf8 is not by accident the same as the delimiter which you are using.

Mike_D · September 5, 2021, 2:48pm

UTF8 guarantees it is not. (The short answer: all extra bytes have the high bit set, so are in the range 128…255, thus outside the ASCII character set)

TimStreater · September 5, 2021, 2:48pm

That’s not possible. See:

and look at the Encoding section.