If you’re interested, here’s the highly optimised algorithm that works exactly as desired with all UFT-8 character sets. Of course you can add other whitespace characters than 9, 10, 13 and 32, but the current ones are the most common.
Further optimization suggestions are welcome.
[code]Private Sub Split(myText As String) As String() #Pragma BackgroundTasks False #Pragma BoundsChecking False #Pragma NilObjectChecking False #Pragma StackOverflowChecking False
Dim matches() As String
If myText = “” Then
Return matches
End If
Dim mb As MemoryBlock = myText
Dim p As Ptr = mb
Dim matchLength, matchEnd, limit As Integer
Dim testchar As UInt8
Dim isWhitespace As Boolean
Dim encoding as TextEncoding = Encodings.UTF8
limit = mb.Size - 1
For matchEnd = 0 To limit
testchar = p.UInt8(matchEnd)
isWhitespace = (testchar = 32) Or (testchar = 13) Or (testchar = 10) Or (testchar = 9)
If isWhitespace Then
If matchLength > 0 Then
matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
matchLength = 0 ' Reset matchLength
End If
matches.AddRow(" ")
Else
matchLength = matchLength + 1
End If
Next
If matchLength > 0 Then
matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
End If
If you’re working with long strings, it would be faster to pre-dimension the array, then trim it back at the end. I’d Redim it to limit since there can’t be more matches than there are bytes.
If you’re not expecting control characters, you can probably consider any byte <= 32 as white space, no?
I know you’ve tested it and it works, but I would still change the reference to matchEnd at the end, outside the loop, to mb.Size or limit + 1.
At the top of the method, make sure the string isn’t empty.
In the end though, you’d probably be shaving off milliseconds at best.
More optimization ideas: do you always use every string?
If not, then instead of processing the strings in the loop, just save an array of Pairs that rememer the start:Length values - then wrap this in a class with a getter so the string extracdtion and the call to .DefineEncoding() are only done on the fly as needed.
I don’t know if Encodings.UTF8 is a function call that is expensive, but you could cache it outside the loop just in case:
dim enc as TextEncoding = Encodings.UTF8
For matchEnd = 0 To limit
...
Good catch Michael, this increases the speed ones more. Thanks.
Nice, I’ve updated this in my code posting.
The strings are variable and I don’t have a chance to predict beforehand that the user will type in just two words or millions of them at the top. Would you have an example of how this could be taken into account in your proposal?
I just compared the speed of the RegEx pattern code?(\S+)(\s*)?[/code] with the speed of the MemoryBlock. It’s really amazing, a very long string with a total of 151628 words and spaces (mixed) with the RexEx pattern takes 18.626 seconds and the MemoryBlock method 0.174 seconds to generate the array with these 151628 entries. Really impressive.
Not even a little. I doubt the regex engine could beat this, but it matters not because Xojo’s RegEx implementation is so slow. Try the test in RegExRX to get a timing of the PCRE engine via the MBS plugin.
The way you’ve updated your code, I think an empty string will return nil, not an empty array. (I haven’t tested this though.) If so, is that what you wanted?
I was thinking this:
Function ...
dim matches() as string
if myText = "" then
return matches
end if
...
You’re right. The compiled version the MemoryBlock method takes 0,04537seconds. Amazing!
[quote=463947:@Kem Tekinay]BTW, it’s always a good idea to do your pragmas this way:
#if not DebugBuild #Pragma BackgroundTasks False #Pragma BoundsChecking False #Pragma NilObjectChecking False #Pragma StackOverflowChecking False #endif[/quote]
This slows down the method in the debugger again.