trying to avoid having to write a full parser for some vb code but I may end up having to regardless
seems that regex, even when set to be greedy give me different results depending on the order of some clauses
my source text is like
While NextToken = True
with a regrex like
((\\s|\
|\\r)+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|([a-zA-Z]+)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")
I get matches for
while
next
to
ken
true
which is NOT what I expect at all
if I switch the position of one portion - the |([a-zA-Z]+) - around so my regex is like
((\\s|\
|\\r)+)|([a-zA-Z]+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")
I get matches for
while
nexttoken
true
the options for the regex ARE set to greedy
I would not have suspected that order of the optional pieces would matter but it apparently does
And I dont have a decent way to necessarily know WHEN it will matter …
suggestions ?
maybe I have to attack this whole mess differently
Let’s simplify the regex to illustrate the point.
Source text:
While NextToken = True
Consider this regex:
It will match:
- The spaces at the start of the text.
- While
- The space between While and NextToken
- NextToken
- The space between NextToken and =
- The space between = and True
- True
Whenever it matches, it moves down the text to the end of the last match.
Consider when the starting point has moved to just before NextToken.
The order in an alternative list DOES matter.
It first tries the spaces code.[/code] That does not match
Then it tries code[/code]. It matches the entire string NextToken. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match. It does not matter whether it is greedy or not.
Now this regex:
Consider when the starting point has moved to just before NextToken.
It first tries the spaces code.[/code] That does not match
It then tries the to
. That does not match
It then tries the next
. That does match. It matches the first four letters of NextToken, i.e. Next. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match somewhere. If it went on to try the next alternative code[/code] it would find a longer match, NextToken. But it does not do so. When it has found a match it is done.
There may be regex engines variants that actually DO go out and evaluate every alternative to find the longest match but I have not worked with them
I do not know precisely what you are trying to accomplish, but you may have use for word boundaries (\b)
If you applied this, the matches would be
- The spaces at the start of the text.
- While
- The space between While and NextToken
- NextToken
- The space between NextToken and =
- The space between = and True
- True
The order in the alternation absolutely matters, and Greedy is not involved. Place the longer strings first, and if they represent entire words, surround the entire clause with word boundary anchors, like this:
\\b(word1|word|wor|wo|w|1)\\b
yeah I have no idea which patterns will match longer strings so ordering mattering sucks
back to the drawing board
Why doesn’t Kem’s suggestion work? Put all the specific words first in the alternative list in reverse-alphabetical order and surrounded by the \b
code|(//(.)(\r|
))|(dim|as|integer|for|to|print|next|sub|function|end\ssub|end\sfunction)|([a-zA-Z]+)|(([±]?[0-9]+(.[0-9]+)?)|[±]?.[0-9]+)|([-+/=]|\<\=|\>\=|\<\>)|(“([^”])")[/code]
So if you combine Kem’s word example with the words that you had in your regex search the order would be:
word1
word
wor
wo
w
to
sub
print
next
integer
function
for
end\ssub
end\sfunction
dim
as
All of these are surrounded by the word delimiter (\b). They are placed in the beginning of the alternative list.
I’ve implemented a different mechanism and now this is no longer an issue