Looking for a faster, alternative algorithm

  1. 2 weeks ago

    Martin T

    Aug 4 Pre-Release Testers Germany
    Edited 2 weeks ago

    Hi, I have the following RegEx pattern which splits a text into words.

    (\s*)?([^\s]+)(\s*)?

    The pattern recognizes spaces before and after a word. It works well and returns exactly the results I want to get.

    Since I might want to split very long texts, I wonder if using this RegEx pattern is the best option, or if another algorithm would be better. For a text consisting of 50000 words the RegEx pattern needs 37,119 ms according to @Kem T RegExRX-App . Do you have an idea/approach for another algorithm and, what could such an algorithm look like?

    Thank you very much for your feedback.

  2. Greg O

    Aug 4 Xojo Inc

    I take it Split( s, “ “ ) isn’t working for you?

  3. Martin T

    Aug 4 Pre-Release Testers Germany
    Edited 2 weeks ago

    No it won't work for me, because it deletes some whitespaces at the results array. I think in this case an algorithm really has to go through every single character and follow a pattern. Like I said, I don't know if it's really faster.

  4. Kem T

    Aug 4 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    Keep in mind that RegExRX uses the MBS plugin as its engine. The built-in RegEx class is, unfortunately, much, much slower for this type of work.

    Your pattern can be rewritten like this:

    (\s*)(\S+)(\s*)

    Keep in mind that this will only capture leading white space in the very first occurrence. All subsequence matches will only include trailing white space because its leading white space was captured by the previous match. For consistency, you might consider this:

    (\S+)(\s*)

    Having said all that, it might be faster to use a MemoryBlock to scan for the right bytes using a Ptr.

  5. Martin T

    Aug 4 Pre-Release Testers Germany

    Thanks Kem for your reply. I forgot to add, that it will only match the whitespaces before a word for the very first word.

    To your suggestion, how could i handle this with memoryblocks? I have no experiences with them. Any tips please?

  6. Norman P

    Aug 4 Pre-Release Testers, Xojo Pro great-white-software.com/blog

    and it doesnt take into account things like punctuation so "foo,bar" is a "word" because there is no white space

  7. Kem T

    Aug 4 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut
    Edited 2 weeks ago

    Off the top of my head and untested:

    dim mb as MemoryBlock = myText
    dim p as Ptr = mb
    
    dim matchStart as integer
    dim matchEnd as integer
    
    dim matches() as string
    dim stage as integer
    
    while matchEnd < m.Size
      dim isWhitespace as boolean
      select case p.Byte( matchEnd )
      case 9, 10, 13, 32
        isWhitespace = true
      end select
      
      select case stage
      case 0 // Just started
        if not isWhitespace then
          stage = 1
        end if
    
      case 1 // We were in some text
        if isWhitespace then
          stage = 2
        end if
    
      case 2 // In whitespace after the text
        if not isWhitespace then
          matches.Append mb.StringValue( matchStart, matchEnd - matchStart )
          matchStart = matchEnd
          stage = 1
        end if
    
      end select
      
      matchEnd = matchEnd + 1
    wend
    
    if matchStart <= ( mb.Size - 1 ) then
      matches.Append mb.StringValue( matchStart, mb.Size - matchStart )
    end if
  8. Kem T

    Aug 4 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    @Norman P and it doesnt take into account things like punctuation so "foo,bar" is a "word" because there is no white space

    Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

  9. Norman P

    Aug 4 Pre-Release Testers, Xojo Pro great-white-software.com/blog
    Edited 2 weeks ago

    certainly Mr O'Brien would dislike those patterns :)

    (\w+)(\s*)

    with "If Mr O'Brien" give me 3 matches
    gotta love languages

  10. Michael D

    Aug 4 Pre-Release Testers, Xojo Pro
    Edited 2 weeks ago

    (off topic a bit) for a really interesting deep dive on RegEx performance (and how it took out most of the internet last month) read https://new.blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ -- the RegEx stuff is towards the bottom of the page, and has some neat animations.

  11. Martin T

    Aug 5 Pre-Release Testers Germany
    Edited 2 weeks ago

    @Kem T while matchEnd < m.Size

    That's nice and good to understand. I've found a little typing error. m.Size needs to be mb.Size.

    Now all I have to do is compare the time that both variants take into account with each other. But my mind tells me that the direct manipulation via the MemoryBlock should be much faster. Any objections?

    What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after.

    @Kem T Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

    Good catch @Kem T ! I modified the pattern to

    (\s*)(\S+)(\s*)

    \w and the Unicode pattern \pL won't match Emojis, but \S+ does!

    @Norman P certainly Mr O'Brien would dislike those patterns :)

    (\w+)(\s*)

    with "If Mr O'Brien" give me 3 matches
    gotta love languages

    Thank you, Norman. Your pattern does not find the apostrophes. I stick with Kem's pattern.

  12. Norman P

    Aug 5 Pre-Release Testers, Xojo Pro great-white-software.com/blog

    That pattern was Kem's suggestion

    @Kem T Right, that's how the pattern is written. Otherwise, use "\w" or, better, "\pL" instead of "\S".

  13. Kem T

    Aug 5 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    @MartinTrippensee What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after

    That should be a fairly simple tweak to the loop so it groups whitespace together, then non-whitespace together. I think you'll only need two stages, and you'll record the match before you need to switch stages.

or Sign Up to reply!