XML valid but not for XOJO

Hi everybody,
I’m trying to load an XML (an italian e-invoice) but XOJO internal rendering raises a number 4 exception for a “not well-formed token”; to better check out everything I’ve copied and pasted the XML into several online validators (including W3Schools one) but they all tell me the XML listing is correct and valid
Moreover, in the effort of trying to strip out any unwanted character from the XML in XOJO, I’ve tried to use this RegEx routine…

Dim reg as new RegEx
reg.searchPattern = "[^\\x09\\x0A\\x0D\\x20-\\xD7FF\\xE000-\\xFFFD\\x10000-x10FFFF]"
reg.replacementPattern = ""
Dim correctXML as string= reg.replace(data)

… but if I use an XMLDocument.Load() method it continues to output the very same error.
The document uses an UTF-8 charset.
What am I possibly doing wrong?

Thanks everyone for the kind replies.

Cut and paste the XML into a constant and pass it to the xmldocument. Do you get the same error?

if there is a <DOCTYPE…> line (usually the second line) , then remove it.

What does the XML look like?
e.g. does it start with some text encoding declaration?

No James, it doesn’t. NO ERRORS.
Why?!?

maybe copy & paste changes text encoding? removes some null characters?

The null characters should already have been stripped with the RegEx…

[quote=445730:@Christian Schmitz]What does the XML look like?
e.g. does it start with some text encoding declaration?[/quote]

It does start with a windows-1252 encoding but I pass the XML through this…
Dim nuovoXML As String nuovoXML =ReplaceAll(data,"windows-1252","UTF-8")
… so there shouldn’t be any problems.

[quote=445732:@Riccardo Santato]No James, it doesn’t. NO ERRORS.
Why?!?[/quote]

That would suggest there’s some issue with the file rather than the XML

[quote=445735:@Riccardo Santato]It does start with a windows-1252 encoding but I pass the XML through this…
Dim nuovoXML As String nuovoXML =ReplaceAll(data,"windows-1252","UTF-8")
… so there shouldn’t be any problems.[/quote]

It would seem that the content you are passing is not UTF-8 encoded and changing the tag will not make it so. Try encoding the text as well as changing the tag. I expect that will fix it.

@Riccardo Santato — I agree with James Dooley: Windows-1252 is an extended ASCII encoding. It has nothing to do with UTF-8 so you need to convert the content of your XML file.

Thanks James, your solution helped me out a lot!
BTW, very often it happes that, when converting out files from P7M to XML, the XML file comes with non-ASCII garbage that I’d like to remove.
I’ve discovered that the RegEx routine that I’ve posted above does not work properly.
Any suggestion?

first off make sure when you read the file in you define the encoding of the text you just read
if you dont you can end up with all kinds of subsequent follow on issues because the encoding is not defined or is not defined correctly

https://blog.xojo.com/2013/08/20/why-are-there-diamonds-in-my-user-interface/

[quote=445769:@Riccardo Santato]Thanks James, your solution helped me out a lot!
BTW, very often it happes that, when converting out files from P7M to XML, the XML file comes with non-ASCII garbage that I’d like to remove.
I’ve discovered that the RegEx routine that I’ve posted above does not work properly.
Any suggestion?[/quote]

Don’t do it would be my suggestion! The whole point of having a standard format is to ensure that you don’t receive corrupt data. If you start cleaning up files, then you could end up processing any old garbage without know it. And any consequences for processing an invalid file would rightly come back on you. Remember you are trying to fix a problem you did not create using guess work - never a good idea when it comes to accounting.

Changing the encoding is part of the standard process, but beyond that the expected response is to reject the file as it is corrupt.