I do Regex searches on web pages to locate DOIs (digital object identifiers). It works well normally, but fails on web pages of the journal the Lancet. I noticed that on one page the HTML was almost 875777 bytes! When I truncated it to the leftmost 700000 bytes the search worked (and was very fast).
Is there a know limit to the length of the string Regex searches? If so, is this adjustable?
What’s your pattern? You might be hitting a recursion or backtracking limit.
I’m using this
“\b(10[.][0-9]{3,}(?:[.][0-9]+)*/(?:(?![”"?&’])\S)+)\b"
I can’t reproduce your results. This code works in 2020r2.1. Note that the string length is 10 M characters.
var rx as new RegEx
rx.SearchPattern = "\b(10[.][0-9]{3,}(?:[.][0-9]+)*/(?:(?![""?&'])\S)+)\b"
var source as string = "10.123.456/x&"
source = source + M_String.Repeat( "a", 10000000 ) + &uA + source
var match as RegExMatch = rx.Search( source )
var matchIndex as integer
while match isa object
matchIndex = matchIndex + 1
AddToResult matchIndex.ToString + ": " + match.SubExpressionString( 0 )
match = rx.Search
wend
I get:
1: 10.123.456/x
2: 10.123.456/x
BTW, this pattern does the same thing but without the unneeded subgroup or negative lookahead:
\b10\.\d{3,}(?:\.\d+)*/[^"?&'\s]+\b
1 Like
Thanks very much for checking. You’re right, it’s not the length. I did more experimenting and it seems the problem is the encoding. The page contents are UTF8. If I define the encoding as something else, like this:
textToSearch = DefineEncoding(textToSearch, encodings.ISOLatin1)
the Regex search now works. Could this be due to invalid encoding for these pages?
FWIW, the UTF-8 encoded text works when I test in RegExRX. It only fails in Xojo.
iirc RegExRx uses MBS RegEx plugin instead of the Xojo one.
4 Likes
@Tim_Parnell Thanks for the tip. I’ve converted the method to use RegExMBS and it indeed works with the same pages where Xojo RegEx fails.
1 Like