Looking for a faster, alternative MemoryBlock algorithm

Martin_T · November 20, 2019, 2:42pm

If you’re interested, here’s the highly optimised algorithm that works exactly as desired with all UFT-8 character sets. Of course you can add other whitespace characters than 9, 10, 13 and 32, but the current ones are the most common.

Further optimization suggestions are welcome.

[code]Private Sub Split(myText As String) As String()
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False

Dim matches() As String

If myText = “” Then
Return matches
End If

Dim mb As MemoryBlock = myText
Dim p As Ptr = mb

Dim matchLength, matchEnd, limit As Integer
Dim testchar As UInt8
Dim isWhitespace As Boolean
Dim encoding as TextEncoding = Encodings.UTF8

limit = mb.Size - 1

For matchEnd = 0 To limit

testchar = p.UInt8(matchEnd)
isWhitespace = (testchar = 32) Or (testchar = 13) Or (testchar = 10) Or (testchar = 9)

If isWhitespace Then

  If matchLength > 0 Then      
    matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
    matchLength = 0 ' Reset matchLength      
  End If

  matches.AddRow(" ")

Else    
  matchLength = matchLength + 1    
End If

Return matches

End Sub[/code]

Kem_Tekinay · November 20, 2019, 3:08pm

Suggestion: rename matchStart to matchLength.

If you’re working with long strings, it would be faster to pre-dimension the array, then trim it back at the end. I’d Redim it to limit since there can’t be more matches than there are bytes.

If you’re not expecting control characters, you can probably consider any byte <= 32 as white space, no?

I know you’ve tested it and it works, but I would still change the reference to matchEnd at the end, outside the loop, to mb.Size or limit + 1.

At the top of the method, make sure the string isn’t empty.

In the end though, you’d probably be shaving off milliseconds at best.

Mike_D · November 20, 2019, 3:15pm

Looks pretty good.

More optimization ideas: do you always use every string?

If not, then instead of processing the strings in the loop, just save an array of Pairs that rememer the start:Length values - then wrap this in a class with a getter so the string extracdtion and the call to .DefineEncoding() are only done on the fly as needed.

I don’t know if Encodings.UTF8 is a function call that is expensive, but you could cache it outside the loop just in case:

  dim enc as TextEncoding = Encodings.UTF8
  For matchEnd = 0 To limit
  ...

Martin_T · November 20, 2019, 3:43pm

Good catch Michael, this increases the speed ones more. Thanks.

Nice, I’ve updated this in my code posting.

The strings are variable and I don’t have a chance to predict beforehand that the user will type in just two words or millions of them at the top. Would you have an example of how this could be taken into account in your proposal?

Why? What’s the thought behind it?

Done.

Martin_T · November 20, 2019, 4:03pm

I just compared the speed of the RegEx pattern code?(\S+)(\s*)?[/code] with the speed of the MemoryBlock. It’s really amazing, a very long string with a total of 151628 words and spaces (mixed) with the RexEx pattern takes 18.626 seconds and the MemoryBlock method 0.174 seconds to generate the array with these 151628 entries. Really impressive.

Norman_Palardy · November 20, 2019, 4:05pm

oh that sounds liek a challenge kem will take up

Kem_Tekinay · November 20, 2019, 4:11pm

Not even a little. I doubt the regex engine could beat this, but it matters not because Xojo’s RegEx implementation is so slow. Try the test in RegExRX to get a timing of the PCRE engine via the MBS plugin.

Kem_Tekinay · November 20, 2019, 4:15pm

Function ...
  ...
  limit = mb.Size - 1
  redim matches( limit )
  dim arrIndex as integer = -1

 ...
  arrIndex = arrIndex + 1
  matches( arrIndex ) = mb.StringValue( ...

...
  redim matches( arrIndex )
  return matches

Kem_Tekinay · November 20, 2019, 4:19pm

The way you’ve updated your code, I think an empty string will return nil, not an empty array. (I haven’t tested this though.) If so, is that what you wanted?

I was thinking this:

Function ...
  dim matches() as string
  if myText = "" then
    return matches
  end if

...

Tim_Parnell · November 20, 2019, 4:21pm

That also has the benefit of removing an unnecessary nesting level.

Martin_T · November 20, 2019, 4:30pm

RegExRX Search Time for the same long String: 66,17 ms. Much faster! Only the filling of the Listbox takes a very long time as expected.

Kem_Tekinay · November 20, 2019, 4:35pm

You’re testing in a compiled version, yes?

BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
  #Pragma BackgroundTasks False
  #Pragma BoundsChecking False
  #Pragma NilObjectChecking False
  #Pragma StackOverflowChecking False
#endif

Martin_T · November 20, 2019, 4:39pm

No, I use the Code Profiler. So I test the debugged version.[quote=463947:@Kem Tekinay]BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False
#endif[/quote]
Advantages/disadvantages?

Kem_Tekinay · November 20, 2019, 4:41pm

I don’t know if it still happens, but that used to kill the ability to debug.

Compile and test again as the debug build is always considerably slower for a lot of things.

Kem_Tekinay · November 20, 2019, 4:42pm

(I’d bet that, once complied, you’ll wonder if it actually did anything. )

Martin_T · November 20, 2019, 4:44pm

You’re right. The compiled version the MemoryBlock method takes 0,04537seconds. Amazing!

[quote=463947:@Kem Tekinay]BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False
#endif[/quote]
This slows down the method in the debugger again.

Kem_Tekinay · November 20, 2019, 4:48pm

Unless you’re planning to only use this through the debugger, I wouldn’t worry about that.