Zapping gremlins, and how slow Mid(j,1) really is

I recently came across a situation where a call to XMLDocument.LoadXML(xmlStr) failed because xmlStr contained a gremlin character (in my case an ASCII 22). Don’t know how it got into the stream but users will paste strange things from strange apps that I then save as an XML file.
No problem I thought, I’ll just parse the xmlStr’s and strip any gremlins:

for j as integer = 1 to n
char = s.Mid(j,1)
a = asc(char)
if a >= 32 then
continue
elseif a = 13 then
continue
etc…
until I find an illegal char

Problem is it turns out that (more than likely) the Mid(j,1) call is extremely slow for big strings (the above took 460 sec for a 500KB string! I suspect time required rises exponentially because the scan must begin at the start of the string for each increment of j?). I can’t just do a byte-wise check because of multi-byte UTF8 chars that are perfectly legal.

So how to do? How do I quickly check each char in a string to ensure it’s not a gremlin?

Peter

Can u do a replaceall(YOURSTRING, chr(32),’’)

Multibyte UTF-8 characters don’t contain bytes of value < 128, so you can use byte-wise functions if you want to look for invalid ASCII gremlins.

Use a regular expression. I’ll paste the code soon.

This will replace all characters less than a space (\x20 or ASC 32) with nothing, EXCEPT for a tab, linefeed, or return.

dim rx as new RegEx
rx.SearchPattern = "(?mi-Us)[\\x00-\\x08\\x0B-\\x0C\\x0E-\\x1F]"
rx.ReplacementPattern = ""

dim rxOptions as RegExOptions = rx.Options
rxOptions.ReplaceAllMatches = true

dim replacedText as string = rx.Replace( sourceText )
(?mi-Us)

What is that part? The rest I get…

Mode switches. They are my preferred way of setting options, but it doesn’t really matter in this case.

I timed this for a 5 MB string with a single Chr( 1 ) in the middle of it. It took 16 ms in the IDE.

Note that using Kem’s code, the string must be in UTF-8, not UTF-16, encoding.
E.g, execute first:

sourceText = sourceText.ConvertEncoding(Encodings.UTF8).

Or does the Regex code now take care of that itself?

I’m afraid that’s not true:

dim s as string = “A”

and you’ll see that s is UTF8 with a single 0x41 byte.

AFAIK, all std ASCII chars are encoded as a single byte in UTF8 (for back compatibility); the fancier chars that came later are multi-byte.

p.

[quote=54955:@Kem Tekinay]This will replace all characters less than a space (\x20 or ASC 32) with nothing, EXCEPT for a tab, linefeed, or return.

[code]
dim rx as new RegEx
rx.SearchPattern = “(?mi-Us)[\x00-\x08\x0B-\x0C\x0E-\x1F]”
rx.ReplacementPattern = “”

dim rxOptions as RegExOptions = rx.Options
rxOptions.ReplaceAllMatches = true

dim replacedText as string = rx.Replace( sourceText )
[/code][/quote]

Brilliant Kem, thanks very much. Haven’t tried it but I’ll take your word for it. Should’ve thought of RegEx.

Merry XMAS to all,
P.

I need to revise my reply to you as I didn’t pay attention to the details. You specified multi-byte, and if that’s true, then you’re absolutely right, I can do byte-wise, knowing that any char <0x20 is illegal (except for CR etc) and will not be part of a multibyte char.

Thx Jonathan,
p.

It looks like that is no longer a requirement. I took a UTF8 string "aa" + Chr( 1 ) + "aa" and converted it to UTF16, then ran it through that RegEx. Looking at the before and after in binary, I got exactly the string I expected, and it’s even in UTF16.

That doesn’t prove anything, because in UTF-16 every 2nd byte would be zero, and replacing 01-bytes with another single-byte char is not expected to fail even in UTF-16. Try replacing the chr(1) with an empty string - if that still, works, then you’re good.

[quote=55024:@Peter Stys]I’m afraid that’s not true:

dim s as string = “A”

and you’ll see that s is UTF8 with a single 0x41 byte.

AFAIK, all std ASCII chars are encoded as a single byte in UTF8 (for back compatibility); the fancier chars that came later are multi-byte.

p.[/quote]
I think Jonathan meant something else:

In UTF-8, if you only look for ASCII chars (any char code < 128), then you can look for them with a byte-wise access, e.g. by putting the string into a MemoryBlock and looping over its bytes, because any non-ASCII char in UTF-8 is represented only by bytes that are >= 128.

I think you misread the pattern because that’s exactly what it does, replace all the characters below ASC 32 (with three exceptions) with nothing. The UTF16 source was 10 bytes. If the RegEx munged it, it would have come out as 4 bytes, but it came out, properly, as 8 bytes with the correct encoding.

[quote=55071:@Thomas Tempelmann]I think Jonathan meant something else:

In UTF-8, if you only look for ASCII chars (any char code < 128), then you can look for them with a byte-wise access, e.g. by putting the string into a MemoryBlock and looping over its bytes, because any non-ASCII char in UTF-8 is represented only by bytes that are >= 128.[/quote]

quite right Thomas, i replied too hastily. I suspect UTF8 was designed this way for precisely this type of reason. Very clever!

p.

Oops, sorry. Somehow this line

rx.ReplacementPattern = ""

looked to me as if the rx.ReplacementPattern was a space.

I just ran a test myself with Kem’s code. I found something very odd:

This works:

dim sourceText as string = "a"+chr(2)+"c" sourceText = sourceText.ConvertEncoding(Encodings.UTF16)

But neither does this:

sourceText = sourceText.ConvertEncoding(Encodings.UTF16BE)

Nor this:

sourceText = sourceText.ConvertEncoding(Encodings.UTF16LE)

That’s screwed up.

Bugged here: <https://xojo.com/issue/31422>

Good catch. It’s treating it like a UTF8 string and removing the nulls, munging the string.

Bottom line: As Thomas suggested, be on the safe side and convert your string to UTF8 before feeding it to a RegEx.

I added a test project to the case. I just used the string “1234” with the SearchPattern “\x00” since the problem is that it’s removing the nulls as if it were a UTF8 string.