Looking for a faster, alternative MemoryBlock algorithm

If you’re interested, here’s the highly optimised algorithm that works exactly as desired with all UFT-8 character sets. Of course you can add other whitespace characters than 9, 10, 13 and 32, but the current ones are the most common.

Further optimization suggestions are welcome.

[code]Private Sub Split(myText As String) As String()
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False

Dim matches() As String

If myText = “” Then
Return matches
End If

Dim mb As MemoryBlock = myText
Dim p As Ptr = mb

Dim matchLength, matchEnd, limit As Integer
Dim testchar As UInt8
Dim isWhitespace As Boolean
Dim encoding as TextEncoding = Encodings.UTF8

limit = mb.Size - 1

For matchEnd = 0 To limit

testchar = p.UInt8(matchEnd)
isWhitespace = (testchar = 32) Or (testchar = 13) Or (testchar = 10) Or (testchar = 9)

If isWhitespace Then

  If matchLength > 0 Then      
    matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
    matchLength = 0 ' Reset matchLength      
  End If

  matches.AddRow(" ")

Else    
  matchLength = matchLength + 1    
End If

Next

If matchLength > 0 Then
matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
End If

Return matches

End Sub[/code]

Suggestion: rename matchStart to matchLength.

If you’re working with long strings, it would be faster to pre-dimension the array, then trim it back at the end. I’d Redim it to limit since there can’t be more matches than there are bytes.

If you’re not expecting control characters, you can probably consider any byte <= 32 as white space, no?

I know you’ve tested it and it works, but I would still change the reference to matchEnd at the end, outside the loop, to mb.Size or limit + 1.

At the top of the method, make sure the string isn’t empty.

In the end though, you’d probably be shaving off milliseconds at best.

Looks pretty good.

More optimization ideas: do you always use every string?

If not, then instead of processing the strings in the loop, just save an array of Pairs that rememer the start:Length values - then wrap this in a class with a getter so the string extracdtion and the call to .DefineEncoding() are only done on the fly as needed.

I don’t know if Encodings.UTF8 is a function call that is expensive, but you could cache it outside the loop just in case:

  dim enc as TextEncoding = Encodings.UTF8
  For matchEnd = 0 To limit
  ...

Good catch Michael, this increases the speed ones more. Thanks.

Nice, I’ve updated this in my code posting.

The strings are variable and I don’t have a chance to predict beforehand that the user will type in just two words or millions of them at the top. Would you have an example of how this could be taken into account in your proposal?

Why? What’s the thought behind it?

Done.

I just compared the speed of the RegEx pattern code?(\S+)(\s*)?[/code] with the speed of the MemoryBlock. It’s really amazing, a very long string with a total of 151628 words and spaces (mixed) with the RexEx pattern takes 18.626 seconds and the MemoryBlock method 0.174 seconds to generate the array with these 151628 entries. Really impressive.

oh that sounds liek a challenge kem will take up :stuck_out_tongue:

Not even a little. I doubt the regex engine could beat this, but it matters not because Xojo’s RegEx implementation is so slow. Try the test in RegExRX to get a timing of the PCRE engine via the MBS plugin.

Function ...
  ...
  limit = mb.Size - 1
  redim matches( limit )
  dim arrIndex as integer = -1

 ...
  arrIndex = arrIndex + 1
  matches( arrIndex ) = mb.StringValue( ...

...
  redim matches( arrIndex )
  return matches

The way you’ve updated your code, I think an empty string will return nil, not an empty array. (I haven’t tested this though.) If so, is that what you wanted?

I was thinking this:

Function ...
  dim matches() as string
  if myText = "" then
    return matches
  end if

...

That also has the benefit of removing an unnecessary nesting level.

RegExRX Search Time for the same long String: 66,17 ms. Much faster! Only the filling of the Listbox takes a very long time as expected.

You’re testing in a compiled version, yes?

BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
  #Pragma BackgroundTasks False
  #Pragma BoundsChecking False
  #Pragma NilObjectChecking False
  #Pragma StackOverflowChecking False
#endif

No, I use the Code Profiler. So I test the debugged version.[quote=463947:@Kem Tekinay]BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False
#endif[/quote]
Advantages/disadvantages?

I don’t know if it still happens, but that used to kill the ability to debug.

Compile and test again as the debug build is always considerably slower for a lot of things.

(I’d bet that, once complied, you’ll wonder if it actually did anything. :slight_smile: )

You’re right. The compiled version the MemoryBlock method takes 0,04537seconds. Amazing!

[quote=463947:@Kem Tekinay]BTW, it’s always a good idea to do your pragmas this way:

#if not DebugBuild
#Pragma BackgroundTasks False
#Pragma BoundsChecking False
#Pragma NilObjectChecking False
#Pragma StackOverflowChecking False
#endif[/quote]
This slows down the method in the debugger again.

Unless you’re planning to only use this through the debugger, I wouldn’t worry about that.