Looking for a faster, alternative MemoryBlock algorithm

  1. 11 months ago

    Martin T

    4 Aug 2019 Testers Germany
    Edited 11 months ago

    Hi, I have the following RegEx pattern which splits a text into words.

    (\s*)?([^\s]+)(\s*)?

    The pattern recognizes spaces before and after a word. It works well and returns exactly the results I want to get.

    Since I might want to split very long texts, I wonder if using this RegEx pattern is the best option, or if another algorithm would be better. For a text consisting of 50000 words the RegEx pattern needs 37,119 ms according to @Kem T RegExRX-App . Do you have an idea/approach for another algorithm and, what could such an algorithm look like?

    Thank you very much for your feedback.

    If you're interested, here's the highly optimised algorithm that works exactly as desired with all UFT-8 character sets. Of course you can add other whitespace characters than 9, 10, 13 and 32, but the current ones are the most common.

    Further optimization suggestions are welcome.

    Private Sub Split(myText As String) As String()
      #Pragma BackgroundTasks False
      #Pragma BoundsChecking False
      #Pragma NilObjectChecking False
      #Pragma StackOverflowChecking False
    
      Dim matches() As String
    
      If myText = "" Then
        Return matches
      End If
    
      Dim mb As MemoryBlock = myText
      Dim p As Ptr = mb
    
      Dim matchLength, matchEnd, limit As Integer
      Dim testchar As UInt8
      Dim isWhitespace As Boolean
      Dim encoding as TextEncoding = Encodings.UTF8
    
      limit = mb.Size - 1
    
      For matchEnd = 0 To limit
      
        testchar = p.UInt8(matchEnd)
        isWhitespace = (testchar = 32) Or (testchar = 13) Or (testchar = 10) Or (testchar = 9)
      
        If isWhitespace Then
        
          If matchLength > 0 Then      
            matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
            matchLength = 0 ' Reset matchLength      
          End If
        
          matches.AddRow(" ")
       
        Else    
          matchLength = matchLength + 1    
        End If
      
      Next
    
      If matchLength > 0 Then  
        matches.AddRow(mb.StringValue(matchEnd - matchLength, matchLength).DefineEncoding(encoding))
      End If
    
      Return matches
    
    End Sub
  2. Greg O

    4 Aug 2019 Xojo Inc scout.galaxy.barn

    I take it Split( s, “ “ ) isn’t working for you?

  3. Martin T

    4 Aug 2019 Testers Germany
    Edited 11 months ago

    No it won't work for me, because it deletes some whitespaces at the results array. I think in this case an algorithm really has to go through every single character and follow a pattern. Like I said, I don't know if it's really faster.

  4. Kem T

    4 Aug 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    Keep in mind that RegExRX uses the MBS plugin as its engine. The built-in RegEx class is, unfortunately, much, much slower for this type of work.

    Your pattern can be rewritten like this:

    (\s*)(\S+)(\s*)

    Keep in mind that this will only capture leading white space in the very first occurrence. All subsequence matches will only include trailing white space because its leading white space was captured by the previous match. For consistency, you might consider this:

    (\S+)(\s*)

    Having said all that, it might be faster to use a MemoryBlock to scan for the right bytes using a Ptr.

  5. Martin T

    4 Aug 2019 Testers Germany

    Thanks Kem for your reply. I forgot to add, that it will only match the whitespaces before a word for the very first word.

    To your suggestion, how could i handle this with memoryblocks? I have no experiences with them. Any tips please?

  6. Norman P

    4 Aug 2019 Testers, Xojo Pro outside admiring the sunshine,...

    and it doesnt take into account things like punctuation so "foo,bar" is a "word" because there is no white space

  7. Kem T

    4 Aug 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut
    Edited 11 months ago

    Off the top of my head and untested:

    dim mb as MemoryBlock = myText
    dim p as Ptr = mb
    
    dim matchStart as integer
    dim matchEnd as integer
    
    dim matches() as string
    dim stage as integer
    
    while matchEnd < m.Size
      dim isWhitespace as boolean
      select case p.Byte( matchEnd )
      case 9, 10, 13, 32
        isWhitespace = true
      end select
      
      select case stage
      case 0 // Just started
        if not isWhitespace then
          stage = 1
        end if
    
      case 1 // We were in some text
        if isWhitespace then
          stage = 2
        end if
    
      case 2 // In whitespace after the text
        if not isWhitespace then
          matches.Append mb.StringValue( matchStart, matchEnd - matchStart )
          matchStart = matchEnd
          stage = 1
        end if
    
      end select
      
      matchEnd = matchEnd + 1
    wend
    
    if matchStart <= ( mb.Size - 1 ) then
      matches.Append mb.StringValue( matchStart, mb.Size - matchStart )
    end if
  8. Kem T

    4 Aug 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    @Norman P and it doesnt take into account things like punctuation so "foo,bar" is a "word" because there is no white space

    Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

  9. Norman P

    4 Aug 2019 Testers, Xojo Pro outside admiring the sunshine,...
    Edited 11 months ago

    certainly Mr O'Brien would dislike those patterns :)

    (\w+)(\s*)

    with "If Mr O'Brien" give me 3 matches
    gotta love languages

  10. Michael D

    4 Aug 2019 Testers, Xojo Pro
    Edited 11 months ago

    (off topic a bit) for a really interesting deep dive on RegEx performance (and how it took out most of the internet last month) read https://new.blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ -- the RegEx stuff is towards the bottom of the page, and has some neat animations.

  11. Martin T

    5 Aug 2019 Testers Germany
    Edited 11 months ago

    @Kem T while matchEnd < m.Size

    That's nice and good to understand. I've found a little typing error. m.Size needs to be mb.Size.

    Now all I have to do is compare the time that both variants take into account with each other. But my mind tells me that the direct manipulation via the MemoryBlock should be much faster. Any objections?

    What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after.

    @Kem T Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

    Good catch @Kem T ! I modified the pattern to

    (\s*)(\S+)(\s*)

    \w and the Unicode pattern \pL won't match Emojis, but \S+ does!

    @Norman P certainly Mr O'Brien would dislike those patterns :)

    (\w+)(\s*)

    with "If Mr O'Brien" give me 3 matches
    gotta love languages

    Thank you, Norman. Your pattern does not find the apostrophes. I stick with Kem's pattern.

  12. Norman P

    5 Aug 2019 Testers, Xojo Pro outside admiring the sunshine,...

    That pattern was Kem's suggestion

    @Kem T Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

  13. Kem T

    5 Aug 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    @MartinTrippensee What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after

    That should be a fairly simple tweak to the loop so it groups whitespace together, then non-whitespace together. I think you'll only need two stages, and you'll record the match before you need to switch stages.

  14. 10 months ago

    Martin T

    8 Sep 2019 Testers Germany
    Edited 10 months ago

    Hi @Kem T ,

    i've just found in tests that the pattern (\s*)(\S+)(\s*) doesn't match a string that only consists of a whitespace (s = " "). This is very important in my case. How do I modify the pattern?

  15. Kem T

    9 Sep 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut
    Edited 10 months ago

    That pattern says, "give me some non-white space. If it's surrounded by white space, that's cool too." This is incompatible with the new requirement of "only white space", so we'll need an OR (an alternator).

    (\s*)(\S+)(\s*)|(\s+)

    If something that is not white space captured, there will be three groups and that something will be in group 2. If ONLY white space is captured, there will be four groups and only group 4 will have content.

  16. Kem T

    9 Sep 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    BTW, to test this properly, I used v.1.9's new ability to virtually split source text into multiple inputs, so my source looks like this:

    this and that<<RegExRX_Split>>
       <<RegExRX_Split>>
     and this<<RegExRX_Split>>
    and that   <<RegExRX_Split>>
    this

    The beta is at:

    http://www.mactechnologies.com/downloads

  17. Martin T

    10 Sep 2019 Testers Germany

    Thanks @Kem T for your suggestion. I have now helped myself differently by first checking whether a string is <> "" and then using LTrim I check whether "" remains. So I find out if the string consists of only one or more spaces.

  18. 8 months ago

    Martin T

    17 Nov 2019 Testers Germany

    @Kem T Off the top of my head and untested:

    I take up the subject again, because I am not quite there yet. The graphic below shows which parts should be recognized. How does Kem's MemoryBlock algorithm have to be modified to recognize these parts? So I want a processing within the algorithm for single whitespaces and for words.
    -image-

  19. Jens K

    17 Nov 2019 near Cologne, Germany

    While I like using MemoryBlocks because of performance, you have to keep in mind that non-ASCII characters take more than one byte in a MemoryBlock, which can complicate things depending on what you intend to do with it. (I had to deal with this problem for example when processing German texts with ä, ö and ü – and Ancient Greek is even more complicated as a MemoryBlock …)

  20. Martin T

    17 Nov 2019 Testers Germany
    Edited 8 months ago

    I changed Kem's code a bit and it works fine with Unicode strings after defining the UTF-8 encoding. I also used it with extremely large strings and it works very, very fast. That's exactly what I want.

    Dim mb As MemoryBlock = myText
    Dim p As Ptr = mb
    Dim matchStart, matchEnd As Integer
    Dim matches() As String
    Dim stage, testchar As UInt8
    Dim isWhitespace As Boolean
    
    While matchEnd < mb.Size
      
      testchar = p.UInt8(matchEnd)
      isWhitespace = (testchar = 9) Or (testchar = 10) Or (testchar = 13) Or (testchar = 32)
      
      Select Case stage
      Case 0 // Just started
        If Not isWhitespace Then
          stage = 1
        Else
          matches.Append(" ")
          matchStart = matchEnd + 1
        End If
        
      Case 1 // We were in some text
        If isWhitespace Then
          stage = 2
        End If
        
      Case 2 // In whitespace after the text    
        If Not isWhitespace Then
          matches.Append(mb.StringValue(matchStart, matchEnd - matchStart).DefineEncoding(Encodings.UTF8))
          matchStart = matchEnd
          stage = 1
        End If
     
      End Select
      
      matchEnd = matchEnd + 1
    Wend
    
    If matchStart <= (mb.Size - 1) Then
      matches.Append(mb.StringValue(matchStart, mb.Size - matchStart).DefineEncoding(Encodings.UTF8))
    End If

    At the moment I get the following instead of the results as in my previous pattern graphic:
    -image-
    -image-

  21. Newer ›

or Sign Up to reply!