Fast character access of a string

Rainer_Hofmann · September 3, 2021, 5:03pm

Hello,

is there any way to access a single character of a string like mystring[3] for accessing the fourth character?
I know that I can access it with mystring.middle(3,1) but that function is too slow for fast string manipulations.

Thanks in advance!

MarkusR · September 3, 2021, 5:07pm

memoryblock?
https://documentation.xojo.com/api/language/me.htmlmoryBlock

DerkJ · September 3, 2021, 5:10pm

There is an example Advanced/Memoryblock/ Fast string append

Christian_Schmitz · September 3, 2021, 5:16pm

The MiddleBytes function is faster as the normal Mob one as it doesn’t need to count utf-8 characters.

Rainer_Hofmann · September 3, 2021, 5:17pm

Thank you, I will try that one!

Kem_Tekinay · September 3, 2021, 5:17pm

It depends. Will any of the potential characters be multi-byte, like © or ?

Rainer_Hofmann · September 3, 2021, 5:18pm

Yes, I am going to use UTF8…

Kem_Tekinay · September 3, 2021, 5:20pm

Then you should use Middle or you will only get a single-byte of a multi-byte character.

Kem_Tekinay · September 3, 2021, 5:21pm

But if you need to do this repeatedly, use Split to get all the characters into an array, then use that for processing.

Rainer_Hofmann · September 3, 2021, 5:25pm

I guess, this will not really speedup my procedure because I have to parse a table with about 5.000.000 rows…

Andrew_Lambert · September 3, 2021, 5:35pm

If you are reading them sequentially then perhaps wrapping it in a BinaryStream will help:

Dim mb As MemoryBlock = GetTheRawData()
Dim stream As New BinaryStream(mb)
Do Until stream.EOF
   Dim nextchar As String = stream.Read(1)
   [...]
Loop

Or

Dim mb As MemoryBlock = GetTheRawData()
Dim stream As New BinaryStream(mb)
stream.Position = 3
Dim fourthchar As String = stream.Read(1)
[...]

Rainer_Hofmann · September 3, 2021, 5:46pm

Thank you, I can try this.
But would that work for UTF8 characters as well?

Rainer_Hofmann · September 3, 2021, 6:05pm

Ok, I have used stream.read(1.encodings.utf8) now.
But unfortunately, after profiling I can see that my method is about 12.7% slower if using the MemoryBlock.
It seems I have to stick to the middle() function…

Kem_Tekinay · September 3, 2021, 6:05pm

Can you code the sql to do what you need and return the results?

Rainer_Hofmann · September 3, 2021, 6:06pm

No, I am reading a csv-file.

Robert_Weaver · September 3, 2021, 6:51pm

You may find that a combination of techniques is best for speeding up the data processing. For example, if you need to find the end of line characters in order to separate the rows of data, then using the split function, with the end of line character as delimiter, is probably the fastest way to do that. Once you have your separate rows, you may find that another technique is faster for the remaining processing.

Rainer_Hofmann · September 3, 2021, 7:22pm

That’s exactly what I am doing. I just thought there has to be a function to access a character at a position of a string with a length of 1. The middle function has always to lookup for the number of characters which has to be slower. However I don’t know the compilation or optimisation details.

Mike_D · September 3, 2021, 7:27pm

I would do this:

read the entire file into memory
split lines into an aray using SplitB (since UTF8 guarantees that all ASCII chars are one byte)
for each line - compare the length using Len() and LenB()
if they are the same, use a fast mode where all your code uses the byte versions, e.g. MidB(), LeftB() etc.
if the 2 lengths differ, then you do have actual Unicode characters, and you should process that line using the Unicode-aware versions (Len(), Mid() etc.) which are slower

Post some code and we’d be happy to critique

Rainer_Hofmann · September 3, 2021, 7:47pm

Not a bad idea.

I am new to Xojo and therefore my code will not look perfect.
It’s my own version of Split() because I have to take care about the double quotes as well.

However, this is my fastest version until now:

method MySplit (extends source as string, delimiter as string = " "): string()
var cnt,len as integer
var arr() as String
var str,ch as string
var instr as Boolean
instr=false

len=source.Length-1
for cnt=0 to len
ch=source.middle(cnt,1)
if ch<>delimiter and ch<>"""" then
str=str+ch
Continue
end if
if ch=delimiter and instr=false then
arr.Add(str)
str=""
Continue
end if
if ch="""" and instr=false then
instr=true
elseif ch="""" and instr=true then
instr=false
end if
next
arr.Add(str)
return arr()

I am sorry, for the bad formatting of copy/paste.

Kem_Tekinay · September 3, 2021, 7:56pm

First, there are already CSV parsers available and you might want to use, or at least look at, one of those.

Second, your code won’t handle escaped quotes like \".

Third, what’s probably the slowest part of the code is str=str+ch. Concatenating strings like this tends to be slow because, each time, a new string is created and the old one destroyed. There are techniques you can use to speed that up such as using an array to “build” the string, then using String.FromArray to join the characters. (If you didn’t have to worry about escaped characters, you could just track the start and end of each segment and use Middle to extract it.)

Next, if the delimiter is going to be a single byte (as is usually the case when splitting CSV), you can still use a MemoryBlock to scan the string for the delimiter, then use StringValue to extract the segments. This is a more advanced technique though and some are not comfortable with MemoryBlocks.

Lastly, please use code tags when posting code to make it easier on us.