more regex q's

Norman_Palardy · December 6, 2019, 8:19pm

trying to avoid having to write a full parser for some vb code but I may end up having to regardless

seems that regex, even when set to be greedy give me different results depending on the order of some clauses

my source text is like

    While NextToken = True

with a regrex like

((\\s|\
|\\r)+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|([a-zA-Z]+)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")

I get matches for
while
next
to
ken

true

which is NOT what I expect at all

if I switch the position of one portion - the |([a-zA-Z]+) - around so my regex is like

((\\s|\
|\\r)+)|([a-zA-Z]+)|(//(.*)(\\r|\
)*)|(dim|as|integer|for|to|print|next|sub|function|end\\s*sub|end\\s*function)|(([+-]?[0-9]+(.[0-9]+)?)|[+-]?.[0-9]+)|([-+/*=]|\\<\\=|\\>\\=|\\<\\>)|("([^"]*)")

I get matches for
while
nexttoken

true

the options for the regex ARE set to greedy

I would not have suspected that order of the optional pieces would matter but it apparently does

And I dont have a decent way to necessarily know WHEN it will matter …
suggestions ?

maybe I have to attack this whole mess differently

Robert_Livingston · December 6, 2019, 9:30pm

Let’s simplify the regex to illustrate the point.
Source text:

   While NextToken = True

Consider this regex:

It will match:

The spaces at the start of the text.
While
The space between While and NextToken
NextToken
The space between NextToken and =
The space between = and True
True

Whenever it matches, it moves down the text to the end of the last match.

Consider when the starting point has moved to just before NextToken.

The order in an alternative list DOES matter.
It first tries the spaces code.[/code] That does not match
Then it tries code[/code]. It matches the entire string NextToken. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match. It does not matter whether it is greedy or not.

Now this regex:

Consider when the starting point has moved to just before NextToken.

It first tries the spaces code.[/code] That does not match
It then tries the to. That does not match
It then tries the next. That does match. It matches the first four letters of NextToken, i.e. Next. IT IS DONE. It does NOT go on to evaluate every alternative in the list to find out whether there is a longer match somewhere. If it went on to try the next alternative code[/code] it would find a longer match, NextToken. But it does not do so. When it has found a match it is done.

There may be regex engines variants that actually DO go out and evaluate every alternative to find the longest match but I have not worked with them

I do not know precisely what you are trying to accomplish, but you may have use for word boundaries (\b)

If you applied this, the matches would be

The spaces at the start of the text.
While
The space between While and NextToken
NextToken
The space between NextToken and =
The space between = and True
True

Kem_Tekinay · December 6, 2019, 9:34pm

The order in the alternation absolutely matters, and Greedy is not involved. Place the longer strings first, and if they represent entire words, surround the entire clause with word boundary anchors, like this:

\\b(word1|word|wor|wo|w|1)\\b

Norman_Palardy · December 6, 2019, 9:46pm

yeah I have no idea which patterns will match longer strings so ordering mattering sucks

back to the drawing board

Robert_Livingston · December 6, 2019, 10:01pm

Why doesn’t Kem’s suggestion work? Put all the specific words first in the alternative list in reverse-alphabetical order and surrounded by the \b

code|(//(.)(\r|
))|(dim|as|integer|for|to|print|next|sub|function|end\ssub|end\sfunction)|([a-zA-Z]+)|(([±]?[0-9]+(.[0-9]+)?)|[±]?.[0-9]+)|([-+/=]|\<\=|\>\=|\<\>)|(“([^”])")[/code]

So if you combine Kem’s word example with the words that you had in your regex search the order would be:

word1
word
wor
wo
w
to
sub
print
next
integer
function
for
end\ssub
end\sfunction
dim
as

All of these are surrounded by the word delimiter (\b). They are placed in the beginning of the alternative list.

Norman_Palardy · December 6, 2019, 10:15pm

I’ve implemented a different mechanism and now this is no longer an issue

more regex q's

I get matches for while next to ken

I get matches for while nexttoken

I get matches for
while
next
to
ken

I get matches for
while
nexttoken