more regex q's

trying to avoid having to write a full parser for some vb code but I may end up having to regardless

seems that regex, even when set to be greedy give me different results depending on the order of some clauses

my source text is like

    While NextToken = True

with a regrex like

((\\s|\
|\\r)+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|([a-zA-Z]+)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")

I get matches for
while
next
to
ken

true

which is NOT what I expect at all

if I switch the position of one portion - the |([a-zA-Z]+) - around so my regex is like

((\\s|\
|\\r)+)|([a-zA-Z]+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")

I get matches for
while
nexttoken

true

the options for the regex ARE set to greedy

I would not have suspected that order of the optional pieces would matter but it apparently does

And I dont have a decent way to necessarily know WHEN it will matter …
suggestions ?

maybe I have to attack this whole mess differently

Let’s simplify the regex to illustrate the point.
Source text:

   While NextToken = True

Consider this regex:

It will match:

  1. The spaces at the start of the text.
  2. While
  3. The space between While and NextToken
  4. NextToken
  5. The space between NextToken and =
  6. The space between = and True
  7. True

Whenever it matches, it moves down the text to the end of the last match.

Consider when the starting point has moved to just before NextToken.

The order in an alternative list DOES matter.
It first tries the spaces code.[/code] That does not match
Then it tries code[/code]. It matches the entire string NextToken. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match. It does not matter whether it is greedy or not.

Now this regex:

Consider when the starting point has moved to just before NextToken.

It first tries the spaces code.[/code] That does not match
It then tries the to. That does not match
It then tries the next. That does match. It matches the first four letters of NextToken, i.e. Next. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match somewhere. If it went on to try the next alternative code[/code] it would find a longer match, NextToken. But it does not do so. When it has found a match it is done.

There may be regex engines variants that actually DO go out and evaluate every alternative to find the longest match but I have not worked with them

I do not know precisely what you are trying to accomplish, but you may have use for word boundaries (\b)

If you applied this, the matches would be

  1. The spaces at the start of the text.
  2. While
  3. The space between While and NextToken
  4. NextToken
  5. The space between NextToken and =
  6. The space between = and True
  7. True

The order in the alternation absolutely matters, and Greedy is not involved. Place the longer strings first, and if they represent entire words, surround the entire clause with word boundary anchors, like this:

\\b(word1|word|wor|wo|w|1)\\b

yeah I have no idea which patterns will match longer strings so ordering mattering sucks

back to the drawing board

Why doesn’t Kem’s suggestion work? Put all the specific words first in the alternative list in reverse-alphabetical order and surrounded by the \b

code|(//(.)(\r|
)
)|(dim|as|integer|for|to|print|next|sub|function|end\ssub|end\sfunction)|([a-zA-Z]+)|(([±]?[0-9]+(.[0-9]+)?)|[±]?.[0-9]+)|([-+/=]|\<\=|\>\=|\<\>)|(“([^”])")[/code]

So if you combine Kem’s word example with the words that you had in your regex search the order would be:

word1
word
wor
wo
w
to
sub
print
next
integer
function
for
end\ssub
end\s
function
dim
as

All of these are surrounded by the word delimiter (\b). They are placed in the beginning of the alternative list.

I’ve implemented a different mechanism and now this is no longer an issue