Hi, I have the following RegEx pattern which splits a text into words.
(\\s*)?([^\\s]+)(\\s*)?
The pattern recognizes spaces before and after a word. It works well and returns exactly the results I want to get.
Since I might want to split very long texts, I wonder if using this RegEx pattern is the best option, or if another algorithm would be better. For a text consisting of 50000 words the RegEx pattern needs 37,119 ms according to @Kem Tekinay RegExRX-App. Do you have an idea/approach for another algorithm and, what could such an algorithm look like?
No it won’t work for me, because it deletes some whitespaces at the results array. I think in this case an algorithm really has to go through every single character and follow a pattern. Like I said, I don’t know if it’s really faster.
Keep in mind that RegExRX uses the MBS plugin as its engine. The built-in RegEx class is, unfortunately, much, much slower for this type of work.
Your pattern can be rewritten like this:
(\\s*)(\\S+)(\\s*)
Keep in mind that this will only capture leading white space in the very first occurrence. All subsequence matches will only include trailing white space because its leading white space was captured by the previous match. For consistency, you might consider this:
(\\S+)(\\s*)
Having said all that, it might be faster to use a MemoryBlock to scan for the right bytes using a Ptr.
dim mb as MemoryBlock = myText
dim p as Ptr = mb
dim matchStart as integer
dim matchEnd as integer
dim matches() as string
dim stage as integer
while matchEnd < m.Size
dim isWhitespace as boolean
select case p.Byte( matchEnd )
case 9, 10, 13, 32
isWhitespace = true
end select
select case stage
case 0 // Just started
if not isWhitespace then
stage = 1
end if
case 1 // We were in some text
if isWhitespace then
stage = 2
end if
case 2 // In whitespace after the text
if not isWhitespace then
matches.Append mb.StringValue( matchStart, matchEnd - matchStart )
matchStart = matchEnd
stage = 1
end if
end select
matchEnd = matchEnd + 1
wend
if matchStart <= ( mb.Size - 1 ) then
matches.Append mb.StringValue( matchStart, mb.Size - matchStart )
end if
That’s nice and good to understand. I’ve found a little typing error. m.Size needs to be mb.Size.
Now all I have to do is compare the time that both variants take into account with each other. But my mind tells me that the direct manipulation via the MemoryBlock should be much faster. Any objections?
What strikes me about your MemoryBlock algorithm is that it always returns the words with spaces together in the matches array. In fact, I need an array that returns every single space and every word without spaces before (for the first word of a paragraph) and after.
Good catch @Kem Tekinay ! I modified the pattern to
(\\s*)(\\S+)(\\s*)
\\w and the Unicode pattern \\pL won’t match Emojis, but \\S+ does! [quote=448853:@Norman Palardy]certainly Mr O’Brien would dislike those patterns
(\\w+)(\\s*)
with “If Mr O’Brien” give me 3 matches
gotta love languages[/quote]
Thank you, Norman. Your pattern does not find the apostrophes. I stick with Kem’s pattern.
That pattern was Kem’s suggestion[quote=448852:@Kem Tekinay]Right, that’s how the pattern is written. Otherwise, use “\w” or, better, “\pL” instead of “\S”.[/quote]
That should be a fairly simple tweak to the loop so it groups whitespace together, then non-whitespace together. I think you’ll only need two stages, and you’ll record the match before you need to switch stages.
i’ve just found in tests that the pattern code(\S+)(\s*)[/code] doesn’t match a string that only consists of a whitespace (s = " "). This is very important in my case. How do I modify the pattern?
That pattern says, “give me some non-white space. If it’s surrounded by white space, that’s cool too.” This is incompatible with the new requirement of “only white space”, so we’ll need an OR (an alternator).
(\\s*)(\\S+)(\\s*)|(\\s+)
If something that is not white space captured, there will be three groups and that something will be in group 2. If ONLY white space is captured, there will be four groups and only group 4 will have content.
Thanks @Kem Tekinay for your suggestion. I have now helped myself differently by first checking whether a string is <> “” and then using LTrim I check whether “” remains. So I find out if the string consists of only one or more spaces.
I take up the subject again, because I am not quite there yet. The graphic below shows which parts should be recognized. How does Kem’s MemoryBlock algorithm have to be modified to recognize these parts? So I want a processing within the algorithm for single whitespaces and for words.
While I like using MemoryBlocks because of performance, you have to keep in mind that non-ASCII characters take more than one byte in a MemoryBlock, which can complicate things depending on what you intend to do with it. (I had to deal with this problem for example when processing German texts with , and and Ancient Greek is even more complicated as a MemoryBlock )
I changed Kem’s code a bit and it works fine with Unicode strings after defining the UTF-8 encoding. I also used it with extremely large strings and it works very, very fast. That’s exactly what I want.
[code]Dim mb As MemoryBlock = myText
Dim p As Ptr = mb
Dim matchStart, matchEnd As Integer
Dim matches() As String
Dim stage, testchar As UInt8
Dim isWhitespace As Boolean
While matchEnd < mb.Size
testchar = p.UInt8(matchEnd)
isWhitespace = (testchar = 9) Or (testchar = 10) Or (testchar = 13) Or (testchar = 32)
Select Case stage
Case 0 // Just started
If Not isWhitespace Then
stage = 1
Else
matches.Append(" ")
matchStart = matchEnd + 1
End If
Case 1 // We were in some text
If isWhitespace Then
stage = 2
End If
Case 2 // In whitespace after the text
If Not isWhitespace Then
matches.Append(mb.StringValue(matchStart, matchEnd - matchStart).DefineEncoding(Encodings.UTF8))
matchStart = matchEnd
stage = 1
End If
End Select
matchEnd = matchEnd + 1
Wend
If matchStart <= (mb.Size - 1) Then
matches.Append(mb.StringValue(matchStart, mb.Size - matchStart).DefineEncoding(Encodings.UTF8))
End If[/code]
At the moment I get the following instead of the results as in my previous pattern graphic: