XML Parsing.

Given this:

Dim MyRawXML as string MyRawXML="<?xml version="1.0" encoding="utf-8"?>??<response>???<result>success</result>???<get_users_response>???<users>????<user>bill</user>????<user>sam</user>????<user>fred</user>???</users>???</get_users_response>??</response>"

I have literally cut and pasted this code form the documentation:

[code] Dim xdoc As XmlDocument
Dim node As XMLNode
Dim count As Integer
Dim out As String

// create a new document
xdoc = New XmlDocument
xdoc.PreserveWhitespace = False

// load and parse the xml in the TextArea, parse_InputText
xdoc.LoadXml(MyRawXML) <—Error here![/code]

I Get an XMLException on run.


I expect because ??? is not valid in an xml document
If you look at that in the debugger what byte(s) is that ?

Its coming back from a restful API. Is there an easy easy way to cleanse this?

Exception Message: msg:XML parser error 4: not well-formed (invalid token)
Exception Error Number: 2

Normal Thanks. I have pulled out the offending ?and it works fine.

Can anyone think of a way to clean the bad stuff out easily? Regex perhaps?

I’m not sure about “easily” – years ago I had a routine that just went character by character and removed any unusual control-type characters that RS’ XML engine didn’t like. It was the only way I could get it to work because the XML was coming in badly formed and my app would crash before I could even load the XML (I opened it as a string and processed it first, then sent the clean one to loadXML and it worked).

Today I’d probably do that with a regex looking for control characters.

Oddly: All the characters in question are ascii: 65533

I had no idea you could go that high in ASCII : )

It appears to be some sort of unicdode issue. The specifics are way above me. Anyone have any ideas?

I’ve been called a lot of things over the years and surprisingly this is among the most common

[quote=31052:@Jay Menna]I have pulled out the offending ?and it works fine.
Can anyone think of a way to clean the bad stuff out easily? Regex perhaps?[/quote]

65533 ?
I’d look at the bytes in the debugger - I have some suspicions but better to be sure.

Use AscB() instead of Asc() to get the byte values. Or, better, just look at the string in the debugger.

mystring=ReplaceAll(mystring,chr(65533),"") solves it…


ASC B = 239

65533 is the “replacement character”

Basically you have a string in some encoding (presuming you DID call define encoding on it) and the character that is there is not one that is in whatever encoding its defined to be

Really seems to right thing is to call define encoding and see what they are