Fast character access of a string

kevin_g · September 6, 2021, 10:33am

One thing to take into consideration is that the characters iterator handles Unicode characters differently to Middle which will naturally make it slower.

Rainer_Hofmann · September 6, 2021, 10:59am

I have added 3 more versions:

Test 1: for i = 0 to u ; source.middle(i,1)
Fields=2000
Took 0.781 seconds

Test 2: for each ch in source.characters
Fields=2000
Took 33.857 seconds

Test 3: for i = 0 to u ; ch = MiddleBytes(i,1)
Fields=2000
Took 0.019 seconds

Test 4: Pointer
Fields=2000
Took 0.017 seconds

Test 5: while IndexOf<>-1
Fields=2000
Took 0.183 seconds

Test 6: while IndexOfBytes<>-1
Fields=2000
Took 0.002 seconds

What really surprised me is that IndexOfBytes seems to be even faster than using a pointer.

Rainer_Hofmann · September 6, 2021, 12:17pm

But the much more interesting part of the story is how the timings change with using more characters (fields, which are 10 characters with a following space character). I have tested the different versions with 1000, 2000, 3000 and 4000 fields.

Test1(source.middle): 0.201 sec (1000), 0.778 sec (2000), 1.746 sec (3000), 3.080 sec (4000)

Test2(for each): 8.613 sec (1000), 33.760 sec (2000), 75.594 sec (3000), 134.049 sec (4000)

Test3(MiddleBytes): 0.016 sec (1000), 0.030 seco (2000), 0.027 sec (3000), 0.048 sec (4000)

Test4(Pointer): 0.006 sec (1000), 0.010 sec (2000), 0.015 seco (3000), 0.030 sec (4000)

Test5(IndexOf): 0.060 sec (1000), 0.182 sec (2000), 0.421 sec (3000), 0.724 sec (4000)

Test6(IndexOfBytes): 0.001 sec (1000), 0.002 sec (2000), 0.003 sec (3000), 0.004 sec (4000)

So, there is a huge difference if testing such functionality with a small test sample compared to a big one!

Mike_D · September 6, 2021, 2:36pm

Ok, there are some gigantic OS differences here:

on macOS Intel and M1 : “for each ch in string.characters” is about 50x faster than String.Middle
on Windows (x64): “for each ch in string.characters” is 34x SLOWER than String.Middle

Reported as <https://xojo.com/issue/65723>
65723 - String.Characters Iterator is 1000x slower on Windows than macOS

I don’t have a Linux system to test with, but @Rainer_Hofmann please add your comments to the feedback case if you can. Thanks!

Rainer_Hofmann · September 6, 2021, 2:49pm

Ok, I will verify this at home on my m1 mac and wil comment on the feedback case afterwards.

Rainer_Hofmann · September 6, 2021, 4:59pm

Here are my results from the M1:
Test 1: for i = 0 to u ; source.middle(i,1)

Fields=4000

Took 1.236 seconds

Test 2: for each ch in source.characters

Fields=4000

Took 0.071 seconds

Test 3: for i = 0 to u ; ch = MiddleBytes(i,1)

Fields=4000

Took 0.031 seconds

Test 4: Pointer

Fields=4000

Took 0.020 seconds

Test 5: while IndexOf<>-1

Fields=4000

Took 0.188 seconds

Test 6: while IndexOfBytes<>-1

Fields=4000

Took 0.002 seconds

You are right, there is a behavior difference compared to the Intel Linux version.

Christian_Schmitz · September 6, 2021, 5:14pm

I may make an updated blog post.
CountFields is faster for me than the loop with MiddleBytes, but slower than MemoryBlock stuff.
So I’ll change my CountOccurrences to use CountFields instead.

Rainer_Hofmann · September 6, 2021, 5:54pm

Christian, CountFields is also a very good and very fast function. But only if you only want to count the characters. The other functions give you the positions of these characters as well.

Maybe, you want look at the project file which I have attached to [https://xojo.com/issue/65723](https://xojo.com/issue/65723)

Christian_Schmitz · September 6, 2021, 6:25pm

It depends on what the goal is.

For me it’s just a bit of benchmarking.

Rainer_Hofmann · September 6, 2021, 6:25pm

Christian, I also have to say that in my example the pointer function is not as fast as the IndexOfBytes function, on both platforms, Linux and Mac M1. Maybe I did something wrong because I am new to Xojo, but I doubt it.

Thomas_Tempelmann · September 7, 2021, 8:10am

Well, that test shows exponential timing. Which means that the iterator is badly implemented.

Ideally, it should remember the last pointer it accessed and then continue from there on the next iteration. But instead, it always starts from the beginning of the string. That means that the implementation of the iterator remembers just the index, and then calls internally a function that needs to find the character by index - which takes more time the longer the string (and index) gets. Had the iterator remembered the pointer into the string it goes over, then the time for increasing the string size would remain linear.

Yep, badly implemented. Not that I’d have expected anything else.

ChristopheDV · September 7, 2021, 8:40am

Time for Xojo to tweak things in the framework, I guess.

Rainer_Hofmann · September 7, 2021, 5:25pm

Thomas, ok, you described the problem well but of course we don’t know the exact implementation. Such unfavorable behavior of fundamental functions of a programming language should of course be fixed with high priority because string functions in particular, which process large amounts of data in loops, often occur and can then lead to very disappointing performance. Fixing such problems would probably also give the Xojo IDE a noticeably better running behavior.
However, other programming languages and developer environments also have their problems. At least I don’t know of any complex ones that don’t have bugs and sometimes annoying runtime behavior.
As the Xojo Community, we can help to point these out and also work together to improve the language. Xojo offers us so many unbelievably positive qualities that we should strive to promote further development in our favor. I know it’s a hassle sometimes, but this way we can contribute something positive.
I could originally have blamed Xojo for my method not running fast enough. But thanks to the support of experienced programmers, I first learned a lot about Xojo and, secondly, got the opportunity to find out a fundamental cause of poor performance of string manipulations and was able to point this out to the Xojo developers. So if the Xojo IDE and our own Xojo developments run a little faster in the future, we can all say that we have contributed to something very useful in Xojo.

Carsten_Belling · September 7, 2021, 8:00pm

Rainer, especially if you need speed you should always know hiow to handle your data. I agree with Thomas, that not all is implemented in the best way in the Xojo framework and there‘s room for optimizations in many places. But, it‘s not impossible to create fast data processing wirh Xojo; we have memory blocks and pointer, very fast array processing etc. Again, you need to know your data and your tools. And Xojo has these.

Rainer_Hofmann · September 8, 2021, 6:50am

Carsten, of course you are absolutely right. But if fundamental functions show unexpectedly exponential runtime behavior, then it can happen that, even as a good software developer, you test with samples that are too small and afterwards the problems arise for the customer, because perhaps only the functionality was tested, but with too little data. I think it is essential that a programming language should be efficient and fast at its core. Of course, you can always do the rest yourself. The overall efficiency of a programming language naturally also results from how quickly and easily can be developed. If you have to continually write workarounds for standard functionalities or components, I don’t take that for granted.

Carsten_Belling · September 8, 2021, 7:34am

I’m not talking about workarounds but about functionality that Xojo already has. And here’s my point: As a “good software developer” you have to know about your data and you have stress tests too. If you know that your software should work with “big data”, you have to test this too, of course. I see no reason to blame the tool you use for bad implementation of functionality that’s not intended for a solution you need.

Rainer_Hofmann · September 8, 2021, 8:37am

I was talking about core functionality everybody is using of a programming language. Of couse, if you know all different kind of data your customers are using you will test against it. Often you don’t know exactly all these data informations and the amount of data might change with the time as well. Then you should be able to trust core functionalities of your programing language. For example, I expect that the sqrt function is working with smaller double type numbers comparable fast as with bigger double type numbers.

Beatrix_Willius · September 8, 2021, 9:33am

Often Xojo is like a car: one size fits everyone. If you want a race car you have to implement it mostly yourself.

In my app you can select a number of files. No big deal you would think. However, when a user selected a large number of files the beachball was visible. FileListMBS was quite a bit faster. But even there I got the beachball. Now there is a parameter to give other apps a bit of processing time.

Rainer_Hofmann · September 10, 2021, 3:54pm

Unfortunately, I have seen that adding integer values to arrays has that weired exponential timing issue as well:

Test: integer array add

10.000.000, took 1.361 seconds
20.000.000, took 3.363 seconds
30.000.000, took 9.975 seconds
40.000.000, took 22.796 seconds

…and doing the same with double values is faster but still exponential:

Test 2: double array add

10.000.000, took 1.351 seconds
20.000.000, took 3.341 seconds
30.000.000, took 5.979 seconds
40.000.000, took 9.500 seconds

There is a lot of performance improvement potential in Xojo, I guess.

Christian_Schmitz · September 10, 2021, 5:13pm

Resize the array at the beginning if you don’t like it.

And read here:

Fast character access of a string

Array size allocation in Xojo