RegEx class fails to match strings in a 55 MB long text

I just ran into a problem with my program where it failed to match the simplest patterns (e.g. a plain search string without any special chars) if the searched pattern is at the end of a 55 MB long string. As soon as I shorted the string significantly, the pattern is found as expected. The same pattern is find in the 55 MB string with InStr, which proves that it’s really in that string.

Here’s some example code:

dim sql as String = // loads a 55 MB SQL dump file with "CREATE INDEX" lines at the very end.
dim re as new RegEx
re.SearchPattern = "CREATE INDEX"
dim match as RegexMatch = re.Search (sql)
dim pos as Integer = sql.InStrB("CREATE INDEX")

Result: pos is > 0, i.e. the “CREATE INDEX” is found, but match is nil, which is wrong.

Is that a known issue? It should be documented but isn’t. Also, it should not fail, or throw an exception if it can’t handle such large texts. Silently failing is terrible.

Does the encoding match ?

I can’t reproduce this.

My test code:

const kTarget as integer = 1024 * 1024 * 55
const kTag as string = "CREATE INDEX"

static s as string = "123456789 "
while s.Bytes < kTarget
  s = s + s
wend
s = s.LeftBytes( kTarget ) + kTag

AddToResult "Pos: " + s.IndexOf( kTag ).ToString

var rx as new RegEx
rx.SearchPattern = kTag
var match as RegExMatch = rx.Search( s )
if match is nil then
  AddToResult "No match"
else
  AddToResult "RegEx: " + match.SubExpressionStartB( 0 ).ToString
end if

(AddToResult just builds a text field.)

Maybe an encoding issue?

I tried to set the encoding both to nil and to UTF8, and none helped.
I’ll try some more.

Is my test valid?

Tested in Xojo 2020r2.1 in 64-bit, fyi.

I can’t reproduce the problem either. My app reads large mbox files in pieces. I increased the size of the pieces to 100 MB and the regex was okay for a large file.

Based on this, I increased my test size to 100 MB with success.

@Thomas_Tempelmann, could it be that the string is not valid UTF8?

Yes, that’s it: When I set the encoding to MacRoman, it works.

Damn. But why does it try to interpret it as UTF8 when I set its encoding to nil?

Either way, RegEx should raise an exception if it can’t handle the text.

Somewhere along the way I picked up the notion that Strings default to UTF8 if no encoding is ever set. The documentation doesn’t say anything to this effect though :confused:

No, definitively not. My app handles strings where the encodings are screwed up and are sorted out. I also do regexes on large strings without encoding.

1 Like

Well, Beatrix, then consider that you may run into issue because you pass invalid utf8 text to the Regex class (if you use that one, and not a API2 or MBS variant), because what we just found out is that if you pass a string with invalid utf8 contents and an encoding of nil to Regex, it won’t work.

1 Like

Not Strings in general, but some functions (not all, however), such as those of the RegEx class, apparently.

Did you do a Feedback case with test data?

MBS requires UTF8 data. Regex isn’t quite there for API 2.