Fast character access of a string

Christian_Schmitz · September 5, 2021, 5:04pm

The problem is that the Middle() function gets called in the loop a lot of times and it has to start the search again. There is not much to optimize, unless you like to build a lookup table.

Beatrix_Willius · September 5, 2021, 5:20pm

Interesting discussion. The blog article on the speed was nice. I have a recursive regex that isn’t very fast. Unfortunately, not even the regex from MBS made a difference.

Rainer_Hofmann · September 5, 2021, 6:31pm

Oh, I didn’t know that because I had not to deal with that details until now.
That makes these things a lot easier, though.

Mike_D · September 5, 2021, 6:32pm

I’m mostly still using API1 - in API2, does String have a ForEach iterator? That would be a nice addition and much faster when you are iterating through the entire set of characters.

Rainer_Hofmann · September 5, 2021, 6:33pm

Yes, a lookup table would be a good solution to prevent the call overhead of the Middle() function as well.

Martin_T · September 5, 2021, 6:33pm

Look String.Characters

Mike_D · September 5, 2021, 6:36pm

Cool! In theory, that should be a lot faster than repeated calls to String.Middle()

@Christian_Schmitz - can you add this to your benchmark suite and update the blog post?

Rainer_Hofmann · September 5, 2021, 6:37pm

Yes, there is. If it’s a fast implementation, it could do the job as well.

Rainer_Hofmann · September 5, 2021, 6:56pm

Strange, in my case it’s much slower if I am using a ‘for each’ loop (6275ms vs 3173ms).

var cnt,len as integer
var arr() as String
var str,ch as string
var instr as Boolean
instr=false

len=source.Length-1
//for cnt=0 to len
for each ch in source.Characters
  //ch=source.middle(cnt,1)
  if ch<>delimiter and ch<>"""" then
    str=str+ch
    Continue
  end if
  if ch=delimiter and instr=false then
    arr.Add(str)
    str=""
    Continue
  end if
  if ch="""" and instr=false then
    instr=true
  elseif ch="""" and instr=true then
    instr=false
  end if
next
arr.Add(str) 
return arr()

DerkJ · September 5, 2021, 6:58pm

Rainer_Hofmann:

var cnt,len as integer
var arr() as String
var str,ch as string
var instr as Boolean
instr=false

len=source.Length-1
for each ch in source.Characters
  //ch=source.middle(cnt,1)
  if ch<>delimiter and ch<>"""" then
    str=str+ch
    Continue
  end if
  if ch=delimiter and instr=false then
    arr.Add(str)
    str=""
    Continue
  end if
  if ch="""" and instr=false then
    instr=true
  elseif ch="""" and instr=true then
    instr=false
  end if
next

how about:

var cnt,len as integer
var arr() as String
var str,ch as string
var instr as Boolean
instr=false

len=source.Length-1
Var sourceChars() As String = Source.characters
for each ch in sourceChars // <- to see if the function recalling is slowing down
  //ch=source.middle(cnt,1)
  if ch<>delimiter and ch<>"""" then
    str=str+ch
    Continue
  end if
  if ch=delimiter and instr=false then
    arr.Add(str)
    str=""
    Continue
  end if
  if ch="""" and instr=false then
    instr=true
  elseif ch="""" and instr=true then
    instr=false
  end if
next

Mike_D · September 5, 2021, 7:00pm

Odd, it should be faster.
A few geeneral suggestions:

add #pragma DisableBackgroundTasks and NilObjectChecking
run tests in a built app (not in the IDE) to get the fastest (and most consistent) measurements
the equality operator (=) is probably doing a slow, unicode-savvy case-insensitive string comparison.
It should be much faster to use String.Compare caseinsensitive, or even just use the old https://documentation.xojo.com/api/text/str.htmlComp in binary mode?

Rainer_Hofmann · September 5, 2021, 7:03pm

with this code I get a Type Mismatch error…

DerkJ · September 5, 2021, 7:05pm

i see it returns an iterable, my mistake…

how about String.SplitBytes("")
http://documentation.xojo.com/api/data_types/string.html#string-splitBytes

perhaps that one could be faster if you read it to a property first.?

Rainer_Hofmann · September 5, 2021, 7:09pm

We are just curious why the ‘for each’ version is slower. I have already a much faster solution for my original problem. But thank you for your suggestion.

DerkJ · September 5, 2021, 7:11pm

Perhaps becuase it’s an class interface that may be re-creating everything every iteration.
We are aware you got a fast solution. It may still be interesting to get more results. As @Christian_Schmitz may add it to it’s blog post.

Rainer_Hofmann · September 5, 2021, 7:21pm

I can just say if I am replacing my code lines:

for cnt=0 to len
    ch=source.middle(cnt,1)

with

for each ch in source.Characters

…it is much slower than before.

Mike_D · September 5, 2021, 10:23pm

@Rainer_Hofmann I’m getting very different results. What version of Xojo are you using? What OS?

I created 3 tests.

“Middle()” was by far the slowest
for each ch in source.characters was about 40x as fast
ch = MiddleBytes() was about 118x as fast as Middle()

Test 1: for i = 0 to u ; ch = source.middle(i,1)
Took 6.952 seconds

Test 2: for each ch in source.characters
Took 0.180 seconds

Test 3: for i = 0 to u ; ch = MiddleBytes(i,1)
Took 0.058 seconds

Using Xojo 2021 R 2.1 on macOS 11.5.2 big sur (Intel)

Project file: https://xochi.com/xojo/unicode/characters1.xojo_binary_project

Rainer_Hofmann · September 6, 2021, 3:49am

Interesting results! I am using Xojo 2021 R 2.1 on macOS 11.5.2 big sur (M1).
I will have more time looking into it in the evening (european time).

Rainer_Hofmann · September 6, 2021, 6:11am

I was able to use your project. But it is running on Xojo 2021 R 2.1 BUT ON LINUX Pop_OS! (Intel).
I had to reduce the field size because it’s running on a slow notebook. Btw. you have exchanged the words Test1 and Test2 when writing to TextArea1 (not important, though).

Interesting enough, I have got very different results, again:
Test 1: for i = 0 to u ; ch = source.middle(i,1)
Fields=1000
Took 0.219 seconds

Test 2: for each ch in source.characters
Fields=1000
Took 8.619 seconds

Test 3: for i = 0 to u ; ch = MiddleBytes(i,1)
Fields=1000
Took 0.021 seconds

So, here again, the solution with for each is much slower!

Rainer_Hofmann · September 6, 2021, 10:12am

Further analysis of the ‘for each’ solution shows exponential timinig behavior if increasing the number of fields:

Test 2: for each ch in source.characters
Fields=100
Took 0.119 seconds

Test 2: for each ch in source.characters
Fields=200
Took 0.386 seconds

Test 2: for each ch in source.characters
Fields=300
Took 0.828 seconds

Test 2: for each ch in source.characters
Fields=400
Took 1.437 seconds

Test 2: for each ch in source.characters
Fields=500
Took 2.225 seconds