Regex fails if the source string contains the Unicode replacement character

TimStreater · November 2, 2023, 2:58pm

This regex:

Var  instr, outstr As String

ro = new RegExOptions
ro.CaseSensitive     = False
ro.ReplaceAllMatches = True

re = new RegEx
re.Options = ro
re.SearchPattern      = "\s\s+"
re.ReplacementPattern = " "

outstr = re.Replace (instr)

is intended to replace white-space runs with a single space. It works for me unless the input string (instr) contains the Unicode replacement character (U+FFFD �, UTF-8: ef bf bd).

In the case where instr does contain this character, then the regex does nothing other than copy instr to outstr untouched.

Is this expected behaviour? If not then I’ll file an Issue.

Edit: macOS Catalina Xojo 2023r3.1

Kem_Tekinay · November 2, 2023, 4:41pm

This works without issue for me. My test code:

Var  instr, outstr As String

instr = &uFFFD + "this  and     that"

var ro as new RegExOptions
ro.CaseSensitive     = False
ro.ReplaceAllMatches = True

var re as new RegEx
re.Options = ro
re.SearchPattern      = "\s\s+"
re.ReplacementPattern = " "

outstr = re.Replace (instr) // "�this and that"

It works even if I define the encoding of instr as nil. What are we missing?

BTW, note that \s matches any whitespace, e.g., space, tab, or newline. Your pattern would replace a run of any combination of those with a single space. If you mean a space, use a space or, if you want visibility, \x20.

TimStreater · November 2, 2023, 5:19pm

Humph. I better do some more testing. My search pattern is deliberate, by the way.

TimStreater · November 2, 2023, 6:20pm

OK - this is more like it. The regex fails if the input contains invalid UTF8, which I force into instr via a MemoryBlock. A size of one for the memoryblock puts in one byte and it fails, where are defining three and putting them all in is OK (the replacement char, is, after all, valid UTF8).

I was perhaps surprised that this didn’t lead to an Exception, rather than different behaviour.

Var  instr, outstr As String, mb As MemoryBlock, re As RegEx, ro As RegExOptions

mb = new MemoryBlock (4)
mb.byte(0) = &hef
mb.byte(1) = &hbf
mb.byte(2) = &hbd
mb.Size = 1              // Change to 3 and the regex works
instr = mb
instr = instr.DefineEncoding (Encodings.UTF8)
instr = "    some  " + instr + "    stuff"

ro = new RegExOptions
ro.CaseSensitive     = False
ro.ReplaceAllMatches = True

re = new RegEx
re.Options = ro
re.SearchPattern      = "\s\s+"
re.ReplacementPattern = " "

outstr = re.Replace (instr)

break