Regex search length limit?

I do Regex searches on web pages to locate DOIs (digital object identifiers). It works well normally, but fails on web pages of the journal the Lancet. I noticed that on one page the HTML was almost 875777 bytes! When I truncated it to the leftmost 700000 bytes the search worked (and was very fast).

Is there a know limit to the length of the string Regex searches? If so, is this adjustable?

What’s your pattern? You might be hitting a recursion or backtracking limit.

I’m using this

“\b(10[.][0-9]{3,}(?:[.][0-9]+)*/(?:(?![”"?&’])\S)+)\b"

I can’t reproduce your results. This code works in 2020r2.1. Note that the string length is 10 M characters.

var rx as new RegEx
rx.SearchPattern = "\b(10[.][0-9]{3,}(?:[.][0-9]+)*/(?:(?![""?&'])\S)+)\b"

var source as string = "10.123.456/x&"
source = source + M_String.Repeat( "a", 10000000 ) + &uA + source

var match as RegExMatch = rx.Search( source )
var matchIndex as integer
while match isa object 
  matchIndex = matchIndex + 1
  AddToResult matchIndex.ToString + ": " + match.SubExpressionString( 0 )
  match = rx.Search
wend

I get:

1: 10.123.456/x
2: 10.123.456/x

BTW, this pattern does the same thing but without the unneeded subgroup or negative lookahead:

\b10\.\d{3,}(?:\.\d+)*/[^"?&'\s]+\b
1 Like

Thanks very much for checking. You’re right, it’s not the length. I did more experimenting and it seems the problem is the encoding. The page contents are UTF8. If I define the encoding as something else, like this:

 textToSearch = DefineEncoding(textToSearch, encodings.ISOLatin1)

the Regex search now works. Could this be due to invalid encoding for these pages?

FWIW, the UTF-8 encoded text works when I test in RegExRX. It only fails in Xojo.

iirc RegExRx uses MBS RegEx plugin instead of the Xojo one.

4 Likes

@Tim_Parnell Thanks for the tip. I’ve converted the method to use RegExMBS and it indeed works with the same pages where Xojo RegEx fails.

1 Like