Var instr, outstr As String
ro = new RegExOptions
ro.CaseSensitive = False
ro.ReplaceAllMatches = True
re = new RegEx
re.Options = ro
re.SearchPattern = "\s\s+"
re.ReplacementPattern = " "
outstr = re.Replace (instr)
is intended to replace white-space runs with a single space. It works for me unless the input string (instr) contains the Unicode replacement character (U+FFFD �, UTF-8: ef bf bd).
In the case where instr does contain this character, then the regex does nothing other than copy instr to outstr untouched.
Is this expected behaviour? If not then I’ll file an Issue.
Var instr, outstr As String
instr = &uFFFD + "this and that"
var ro as new RegExOptions
ro.CaseSensitive = False
ro.ReplaceAllMatches = True
var re as new RegEx
re.Options = ro
re.SearchPattern = "\s\s+"
re.ReplacementPattern = " "
outstr = re.Replace (instr) // "�this and that"
It works even if I define the encoding of instr as nil. What are we missing?
BTW, note that \s matches any whitespace, e.g., space, tab, or newline. Your pattern would replace a run of any combination of those with a single space. If you mean a space, use a space or, if you want visibility, \x20.
OK - this is more like it. The regex fails if the input contains invalid UTF8, which I force into instr via a MemoryBlock. A size of one for the memoryblock puts in one byte and it fails, where are defining three and putting them all in is OK (the replacement char, is, after all, valid UTF8).
I was perhaps surprised that this didn’t lead to an Exception, rather than different behaviour.
Var instr, outstr As String, mb As MemoryBlock, re As RegEx, ro As RegExOptions
mb = new MemoryBlock (4)
mb.byte(0) = &hef
mb.byte(1) = &hbf
mb.byte(2) = &hbd
mb.Size = 1 // Change to 3 and the regex works
instr = mb
instr = instr.DefineEncoding (Encodings.UTF8)
instr = " some " + instr + " stuff"
ro = new RegExOptions
ro.CaseSensitive = False
ro.ReplaceAllMatches = True
re = new RegEx
re.Options = ro
re.SearchPattern = "\s\s+"
re.ReplacementPattern = " "
outstr = re.Replace (instr)
break