break string by "pattern"

DaveS · December 9, 2016, 9:54pm

I need to quickly (as fast as possible), split a string… but not by a common delimiter. but rather by segments of repeating characters.
The string can be “n” characters long
It will consist of runs of repeating characters from 1 to “n” characters each
I need to be able to reiterate over the string and return the starting character of the pattern and the length of the pattern

something like this

 s="11112233333344444111132"
start = 1
do 
index=getPattern(s,index, length) // length would be a byref, and would return the start of the pattern
if index<0 then exit do
loop

which would return

start=1 , length=4 // “1s”
start=5 , length=2 // “2s”
start=7, length=6 // “3s”
start=13, length=5 // “4s”
start=18, length=4 // “1s”
start=22, length = 1 // 3
start=23, length = 1 // 2

Kind of like InStr, except we don’t “know” what we are looking for, all we know is it is anything BUT the “current” character

and all the characters will be [“0”-“9”]

Norman_P · December 9, 2016, 9:59pm

repeated calls to instr since you can

grab the first char
find the first spot in your string AFTER the first one that is NOT this char (there’s your first pattern & count)
grab this char and repeat

just use instr to jump through the string

DaveS · December 9, 2016, 10:08pm

Uh… Norman… Instr requires you know what you are looking for… If I “KNEW” the character that started the next sequence , sure that could work… but I don’t

my example shows “1111122222333344…”
but it could very well be “99997773322147”

guessing it would be like the “C” function [quote]find_first_not_of[/quote]

Norman_P · December 9, 2016, 10:25pm

it is just a rle encoding which there are lots of very simple implementations on the net that can be quickly ported
just avoid mid etc which can be slow
maybe shove everything in a memoryblock and use “byte” since 0-9 are still single byte in UTF-8, ASCII etc

DaveS · December 9, 2016, 10:30pm

while I am not using it as RLE (run length encoding, for those reading this that didn’t know ), it is in fact closely related

basicly it is holding a attribute for a corresponding character (for reasons I won’t go into, using Styleruns, or RTF is not an option due to the source of the text)

right now I’m iterating over the string char by char, and updating the attribute when it changes
I’d rather break it into bigger chunks to increase (hopefully) the overall speed

Kem_Tekinay · December 10, 2016, 3:23am

dim rx as new RegEx
rx.SearchPattern = "(\\d)\\g1*"

dim match as RegExMatch = rx.Search( theText )
while match isa RegExMatch
  dim start as integer = match.SubExpressionStart( 0 ) + 1
  dim char as string = match.SubExpressionString( 1 )
  dim length as integer = match.SubExpressionString( 0 ).LenB

  match = rx.Search()
wend

Not tested, but that should work.

Norman_P · December 10, 2016, 4:05am

Private Function RLE(encode as string) as string
  // something like this
  // 
  //  s="11112233333344444111132"
  // start = 1
  // do 
  // index=getPattern(s,index, length) // length would be a byref, and would return the start of the pattern
  // if index<0 then exit do
  // loop
  // which would return
  // 
  // start=1 , length=4 // "1s"
  // start=5 , length=2 // "2s"
  // start=7, length=6 // "3s"
  // start=13, length=5 // "4s"
  // start=18, length=4 // "1s"
  // start=22, length = 1 // 3
  // start=23, length = 1 // 2
  
  dim results() as string
  dim prev as string
  dim count as integer
  
  for i as integer = 1 to encode.Len()
    if i = 1 then prev = encode.mid(i,1)
    
    if encode.mid(i,1) = prev then 
      count = count + 1
    else
      results.append "[" + prev + "|" + str(count) + "]"
      prev = encode.mid(i,1)
      count = 1
    end if
    
  next
  
  results.append "[" + prev + "|" + str(count) + "]"
  
  return join(results,"")
  
End Function

Private Function RLE2(theText as string) as string
  dim results() as string
  
  dim rx as new RegEx
  rx.SearchPattern = "(\\d)\\g1*"
  
  dim match as RegExMatch = rx.Search( theText )
  while match isa RegExMatch
      dim start as integer = match.SubExpressionStartB( 0 ) + 1
      dim char as string = match.SubExpressionString( 1 )
      dim length as integer = match.SubExpressionString( 0 ).LenB
      
      results.append "]" + char + "|" + str(length) + "]" 
     
      match = rx.Search()
  wend
  
  return join(results,"")
  
End Function

		dim s as string = "11112233333344444111132"
		
		dim startT1 as double = microseconds
		for i as integer = 1 to 1000
				dim r1 as string = rle(s)
		next
		dim endT1 as double = microseconds
		
		dim startT2 as double = microseconds
		for i as integer = 1 to 1000
				dim r as string =  rle2( s )
		next
		dim endT2 as double = microseconds
		
		dim diff1 as double = endt1 - startT1
		dim diff2 as double = endt2 - startT2
		
		break

For this string size mine is quite a bit quicker (nearly 50% faster)
I’d expect different performance characteristics with

longer strings
strings that repeat characters more /or less

Some quick tests show there is a combination of length & repeating that makes one quicker than the other

		dim s as string = "111111ggggggbbbbbbb888888"

is just about dead even with mine the tiniest bit faster

		dim s as string = "123456789012345678901234567890"

Has mine faster

		dim s as string = "11111bbbbbtttt6666611111bbbbbtttt6666611111bbbbbtttt6666611111bbbbbtttt66666"

but this string has the regex one about 50% the time of mine