Iterating character speed

Sometimes you have to loop over characters in a string in Xojo. Whether you count something, search for patterns or you want to replace some characters, the reasons are divers, but performance may matter.

Let’s try four different ways and report how much time is needed for each function.

My test file is about 300,000 characters big, stored as UTF-8 and contains various German umlauts, so we have a couple of two byte characters. All tests are made with DisableBackgroundTasks pragma set to reduce background activity. For the timing we run each block 10 times to get an average duration.

String.Characters

First way is using String.Characters, which does create an iterator over the characters. Basically it creates an Iterable object and converts String to Text internally. Then creates an iterator object, where the for loop internally calls MoveNext and Value functions, which includes wrapping the string for each character into a variant. Here is the loop:

Dim m1 As Double = System.Microseconds
// using String.Characters
For Each c As String In t.Characters
	If Asc(c) = 13 Then
		n1 = n1 + 1
	End If
Next
Dim m2 As Double = System.Microseconds

In my test this takes about 550 ms to run over 300,000 characters of text. Let’s see if we can do better.

String.Split

We call String.Split with an empty string as delimiter to split by characters. So the function walks over the text, looks where characters begin and end and copies them into new strings and adds them to an array. Then we traverse that array with a for each loop:

Dim chars() As String = t.Split("")
// using Split
For Each c As String In chars
	If Asc(c) = 13 Then
		n2 = n2 + 1
	End If
Next
chars = Nil // free memory
Dim m3 As Double = System.Microseconds

In our test this takes about 110 ms on the same text.

StringCodePointsMBS

We add StringCodePointsMBS for version 21.3 of MBS Xojo DataTypes Plugin. This function returns an array with UInt32 representing the code points. We skip creating the string objects to save some time here, but we can handle correctly unicode characters above 65535, which won’t fit in 16 bit integers.

// using new StringCodePointsMBS function in 21.3
Dim values() As UInt32 = StringCodePointsMBS(t)
For Each codePoint As UInt32 In values
	If codePoint = 13 Then
		n3 = n3 + 1
	End If
Next
values = Nil // free memory
Dim m4 As Double = System.Microseconds

In our test this takes about 51 ms per run.

Memoryblock

The fastest way is to not bother about unicode characters and just look on the bytes. By converting string to Memoryblock, the bytes are copied and you can travers the new memory block like this:

// using Memoryblock
Dim mem As MemoryBlock = t
Dim u As Integer = mem.Size - 1
For i As Integer = 0 To u
	If mem.UInt8Value(i) = 13 Then
		n4 = n4 + 1
	End If
Next
mem = Nil // free memory
Dim m5 As Double = System.Microseconds

This takes about 50 ms, just a bit faster than our plugin function. But please try it with :grinning:, where you would get a 4 byte memory block for one character.

Feel free to try our StringCodePointsMBS in the next pre-release of MBS Xojo Plugins 21.3.

4 Likes

I just had a look at the new function. But I can’t do much with the result because the inverse function FromUnicodeCodepoint is only available for Text and not for String.

Encodings.UTF8.Chr(x) maybe instead?

One note: .Characters will return decomposed characters as multiple code points in a single pass whereas Split and, I think, the MBS function will split them.

1 Like

In my preliminary tests using StringCodePointsMBS with Encodings.UTF8.Chr(x) is significantly slower than a simple split with my algorithm.

Well, the speed differences have of course a reason.
If you build strings from the code points, you may better go with Split()

Thats what I saw when I tested Christians Code from his blog. The MemoryBlock and Split Methods doesn’t take care of composed Characters. Thats why Xojo introduced String.Characters.

If there is interest, I may just add another function to get you composed characters.
But for most of processing where I could use this function, I wouldn’t need that detail.
Just for walking over characters and filtering a bit or counting something.

1 Like

I didn’t mean to imply this was a drawback of your function, merely a difference.