I need to quickly (as fast as possible), split a string… but not by a common delimiter. but rather by segments of repeating characters.
The string can be “n” characters long
It will consist of runs of repeating characters from 1 to “n” characters each
I need to be able to reiterate over the string and return the starting character of the pattern and the length of the pattern
something like this
s="11112233333344444111132"
start = 1
do
index=getPattern(s,index, length) // length would be a byref, and would return the start of the pattern
if index<0 then exit do
loop
which would return
start=1 , length=4 // “1s”
start=5 , length=2 // “2s”
start=7, length=6 // “3s”
start=13, length=5 // “4s”
start=18, length=4 // “1s”
start=22, length = 1 // 3
start=23, length = 1 // 2
Kind of like InStr, except we don’t “know” what we are looking for, all we know is it is anything BUT the “current” character
Uh… Norman… Instr requires you know what you are looking for… If I “KNEW” the character that started the next sequence , sure that could work… but I don’t
my example shows “1111122222333344…”
but it could very well be “99997773322147”
guessing it would be like the “C” function [quote]find_first_not_of[/quote]
it is just a rle encoding which there are lots of very simple implementations on the net that can be quickly ported
just avoid mid etc which can be slow
maybe shove everything in a memoryblock and use “byte” since 0-9 are still single byte in UTF-8, ASCII etc
while I am not using it as RLE (run length encoding, for those reading this that didn’t know ), it is in fact closely related
basicly it is holding a attribute for a corresponding character (for reasons I won’t go into, using Styleruns, or RTF is not an option due to the source of the text)
right now I’m iterating over the string char by char, and updating the attribute when it changes
I’d rather break it into bigger chunks to increase (hopefully) the overall speed
dim rx as new RegEx
rx.SearchPattern = "(\\d)\\g1*"
dim match as RegExMatch = rx.Search( theText )
while match isa RegExMatch
dim start as integer = match.SubExpressionStart( 0 ) + 1
dim char as string = match.SubExpressionString( 1 )
dim length as integer = match.SubExpressionString( 0 ).LenB
match = rx.Search()
wend
Private Function RLE(encode as string) as string
// something like this
//
// s="11112233333344444111132"
// start = 1
// do
// index=getPattern(s,index, length) // length would be a byref, and would return the start of the pattern
// if index<0 then exit do
// loop
// which would return
//
// start=1 , length=4 // "1s"
// start=5 , length=2 // "2s"
// start=7, length=6 // "3s"
// start=13, length=5 // "4s"
// start=18, length=4 // "1s"
// start=22, length = 1 // 3
// start=23, length = 1 // 2
dim results() as string
dim prev as string
dim count as integer
for i as integer = 1 to encode.Len()
if i = 1 then prev = encode.mid(i,1)
if encode.mid(i,1) = prev then
count = count + 1
else
results.append "[" + prev + "|" + str(count) + "]"
prev = encode.mid(i,1)
count = 1
end if
next
results.append "[" + prev + "|" + str(count) + "]"
return join(results,"")
End Function
Private Function RLE2(theText as string) as string
dim results() as string
dim rx as new RegEx
rx.SearchPattern = "(\\d)\\g1*"
dim match as RegExMatch = rx.Search( theText )
while match isa RegExMatch
dim start as integer = match.SubExpressionStartB( 0 ) + 1
dim char as string = match.SubExpressionString( 1 )
dim length as integer = match.SubExpressionString( 0 ).LenB
results.append "]" + char + "|" + str(length) + "]"
match = rx.Search()
wend
return join(results,"")
End Function
dim s as string = "11112233333344444111132"
dim startT1 as double = microseconds
for i as integer = 1 to 1000
dim r1 as string = rle(s)
next
dim endT1 as double = microseconds
dim startT2 as double = microseconds
for i as integer = 1 to 1000
dim r as string = rle2( s )
next
dim endT2 as double = microseconds
dim diff1 as double = endt1 - startT1
dim diff2 as double = endt2 - startT2
break
For this string size mine is quite a bit quicker (nearly 50% faster)
I’d expect different performance characteristics with
longer strings
strings that repeat characters more /or less
Some quick tests show there is a combination of length & repeating that makes one quicker than the other
dim s as string = "111111ggggggbbbbbbb888888"
is just about dead even with mine the tiniest bit faster
dim s as string = "123456789012345678901234567890"
Has mine faster
dim s as string = "11111bbbbbtttt6666611111bbbbbtttt6666611111bbbbbtttt6666611111bbbbbtttt66666"
but this string has the regex one about 50% the time of mine