Looking for a faster, alternative MemoryBlock algorithm

Martin_T · August 4, 2019, 12:27pm

Hi, I have the following RegEx pattern which splits a text into words.

(\\s*)?([^\\s]+)(\\s*)?

The pattern recognizes spaces before and after a word. It works well and returns exactly the results I want to get.

Since I might want to split very long texts, I wonder if using this RegEx pattern is the best option, or if another algorithm would be better. For a text consisting of 50000 words the RegEx pattern needs 37,119 ms according to @Kem Tekinay RegExRX-App. Do you have an idea/approach for another algorithm and, what could such an algorithm look like?

Thank you very much for your feedback.

Greg_O_Lone · August 4, 2019, 1:06pm

I take it Split( s, ) isnt working for you?

Martin_T · August 4, 2019, 1:13pm

No it won’t work for me, because it deletes some whitespaces at the results array. I think in this case an algorithm really has to go through every single character and follow a pattern. Like I said, I don’t know if it’s really faster.

Kem_Tekinay · August 4, 2019, 1:37pm

Keep in mind that RegExRX uses the MBS plugin as its engine. The built-in RegEx class is, unfortunately, much, much slower for this type of work.

Your pattern can be rewritten like this:

(\\s*)(\\S+)(\\s*)

Keep in mind that this will only capture leading white space in the very first occurrence. All subsequence matches will only include trailing white space because its leading white space was captured by the previous match. For consistency, you might consider this:

(\\S+)(\\s*)

Having said all that, it might be faster to use a MemoryBlock to scan for the right bytes using a Ptr.

Martin_T · August 4, 2019, 1:43pm

Thanks Kem for your reply. I forgot to add, that it will only match the whitespaces before a word for the very first word.

To your suggestion, how could i handle this with memoryblocks? I have no experiences with them. Any tips please?

Norman_Palardy · August 4, 2019, 1:49pm

and it doesnt take into account things like punctuation so “foo,bar” is a “word” because there is no white space

Kem_Tekinay · August 4, 2019, 1:50pm

Off the top of my head and untested:

dim mb as MemoryBlock = myText
dim p as Ptr = mb

dim matchStart as integer
dim matchEnd as integer

dim matches() as string
dim stage as integer

while matchEnd < m.Size
  dim isWhitespace as boolean
  select case p.Byte( matchEnd )
  case 9, 10, 13, 32
    isWhitespace = true
  end select
  
  select case stage
  case 0 // Just started
    if not isWhitespace then
      stage = 1
    end if

  case 1 // We were in some text
    if isWhitespace then
      stage = 2
    end if

  case 2 // In whitespace after the text
    if not isWhitespace then
      matches.Append mb.StringValue( matchStart, matchEnd - matchStart )
      matchStart = matchEnd
      stage = 1
    end if

  end select
  
  matchEnd = matchEnd + 1
wend

if matchStart <= ( mb.Size - 1 ) then
  matches.Append mb.StringValue( matchStart, mb.Size - matchStart )
end if

Kem_Tekinay · August 4, 2019, 1:55pm

Right, that’s how the pattern is written. Otherwise, use “\w” or, better, “\pL” instead of “\S”.

Norman_Palardy · August 4, 2019, 2:08pm

certainly Mr O’Brien would dislike those patterns

(\\w+)(\\s*)

with “If Mr O’Brien” give me 3 matches
gotta love languages

Mike_D · August 4, 2019, 4:25pm

(off topic a bit) for a really interesting deep dive on RegEx performance (and how it took out most of the internet last month) read https://new.blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ – the RegEx stuff is towards the bottom of the page, and has some neat animations.

Martin_T · August 5, 2019, 10:23am

That’s nice and good to understand. I’ve found a little typing error. m.Size needs to be mb.Size.

Now all I have to do is compare the time that both variants take into account with each other. But my mind tells me that the direct manipulation via the MemoryBlock should be much faster. Any objections?

What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after.

Good catch @Kem Tekinay ! I modified the pattern to

(\\s*)(\\S+)(\\s*)

\\w and the Unicode pattern \\pL won’t match Emojis, but \\S+ does! [quote=448853:@Norman Palardy]certainly Mr O’Brien would dislike those patterns

(\\w+)(\\s*)

with “If Mr O’Brien” give me 3 matches
gotta love languages[/quote]
Thank you, Norman. Your pattern does not find the apostrophes. I stick with Kem’s pattern.

Norman_Palardy · August 5, 2019, 3:47pm

That pattern was Kem’s suggestion[quote=448852:@Kem Tekinay]Right, that’s how the pattern is written. Otherwise, use “\w” or, better, “\pL” instead of “\S”.[/quote]

Kem_Tekinay · August 5, 2019, 4:27pm

That should be a fairly simple tweak to the loop so it groups whitespace together, then non-whitespace together. I think you’ll only need two stages, and you’ll record the match before you need to switch stages.

Martin_T · September 8, 2019, 2:41pm

Hi @Kem Tekinay ,

i’ve just found in tests that the pattern code(\S+)(\s*)[/code] doesn’t match a string that only consists of a whitespace (s = " "). This is very important in my case. How do I modify the pattern?

Kem_Tekinay · September 9, 2019, 1:59pm

That pattern says, “give me some non-white space. If it’s surrounded by white space, that’s cool too.” This is incompatible with the new requirement of “only white space”, so we’ll need an OR (an alternator).

(\\s*)(\\S+)(\\s*)|(\\s+)

If something that is not white space captured, there will be three groups and that something will be in group 2. If ONLY white space is captured, there will be four groups and only group 4 will have content.

Kem_Tekinay · September 9, 2019, 2:01pm

BTW, to test this properly, I used v.1.9’s new ability to virtually split source text into multiple inputs, so my source looks like this:

this and that<<RegExRX_Split>>
   <<RegExRX_Split>>
 and this<<RegExRX_Split>>
and that   <<RegExRX_Split>>
this

The beta is at:

http://www.mactechnologies.com/downloads

Martin_T · September 10, 2019, 9:41pm

Thanks @Kem Tekinay for your suggestion. I have now helped myself differently by first checking whether a string is <> “” and then using LTrim I check whether “” remains. So I find out if the string consists of only one or more spaces.

Martin_T · November 17, 2019, 3:30pm

I take up the subject again, because I am not quite there yet. The graphic below shows which parts should be recognized. How does Kem’s MemoryBlock algorithm have to be modified to recognize these parts? So I want a processing within the algorithm for single whitespaces and for words.

JensK · November 17, 2019, 4:10pm

While I like using MemoryBlocks because of performance, you have to keep in mind that non-ASCII characters take more than one byte in a MemoryBlock, which can complicate things depending on what you intend to do with it. (I had to deal with this problem for example when processing German texts with , and and Ancient Greek is even more complicated as a MemoryBlock )

Martin_T · November 17, 2019, 4:21pm

I changed Kem’s code a bit and it works fine with Unicode strings after defining the UTF-8 encoding. I also used it with extremely large strings and it works very, very fast. That’s exactly what I want.

[code]Dim mb As MemoryBlock = myText
Dim p As Ptr = mb
Dim matchStart, matchEnd As Integer
Dim matches() As String
Dim stage, testchar As UInt8
Dim isWhitespace As Boolean

While matchEnd < mb.Size

testchar = p.UInt8(matchEnd)
isWhitespace = (testchar = 9) Or (testchar = 10) Or (testchar = 13) Or (testchar = 32)

Select Case stage
Case 0 // Just started
If Not isWhitespace Then
stage = 1
Else
matches.Append(" ")
matchStart = matchEnd + 1
End If

Case 1 // We were in some text
If isWhitespace Then
stage = 2
End If

Case 2 // In whitespace after the text
If Not isWhitespace Then
matches.Append(mb.StringValue(matchStart, matchEnd - matchStart).DefineEncoding(Encodings.UTF8))
matchStart = matchEnd
stage = 1
End If

End Select

matchEnd = matchEnd + 1
Wend

If matchStart <= (mb.Size - 1) Then
matches.Append(mb.StringValue(matchStart, mb.Size - matchStart).DefineEncoding(Encodings.UTF8))
End If[/code]
At the moment I get the following instead of the results as in my previous pattern graphic: