I was working with some XML and was getting a response back from the Remote server.
Strangely, the data was filled with garbage characters ASCII 65533. So I stripped out anything that was 65533 before I deconstructed XML and the problems all went away.
Now I have a problem returning and XOJO 4.1 is giving me major problems.
IS there an easy way to force the XML to be UTF-8 only?
That’s an odd character as it is not valid UTF-8 at all. At first I thought it might be a BOM, but this works out to &hFFFD and a BOM is &hFFFE or &hFEFF, and that’s for UTF-16.
The characters &hFE and &hFF are never valid in UTF-8, so you can safely filter those out, but &hFD can be valid as a leading byte, so that would be harder. You’d have to write a method to filter out invalid UTF-8 bytes. This can be done through a MemoryBlock and should be pretty quick and not too hard.
Are these characters just appearing randomly throughout the text?
OK, I’ve added this method to my M_String module, but have not posted the update to my web site yet. It ensures a string is made into valid UTF-8, discarding whatever doesn’t make sense. Comments welcome:
Protected Function MakeValidUTF8(src As String) As String
// Turn the given string into valid UTF-8.
// Filters out invalid characters, so it might return an empty string.
if src.Encoding = nil then
src = src.DefineEncoding( Encodings.UTF8 )
elseif src.Encoding <> Encodings.UTF8 then
src = src.ConvertEncoding( Encodings.UTF8 )
end if
if src = "" or Encodings.UTF8.IsValidData( src ) then
return src
end if
// If we get here, we have a non-empty string defined as UTF8, but it's not valid.
// We have to remove the invalid bytes.
dim mb as MemoryBlock = src
dim p as Ptr = mb
dim lastIndex as integer = mb.Size - 1
dim writeIndex as integer
dim readIndex as integer
while readIndex <= lastIndex
dim thisByte as integer = p.Byte( readIndex )
if thisByte <= &b01111111 then
p.Byte( writeIndex ) = thisByte
readIndex = readIndex + 1
writeIndex = writeIndex + 1
elseif thisByte >= &b11111110 then // Invalid byte
readIndex = readIndex + 1
else // It's a leading byte so figure out how many valid bytes should be in the group and check them
dim byteCount as integer
if thisByte >= &b11111100 then
byteCount = 6
elseif thisByte >= &b11111000 then
byteCount = 5
elseif thisByte >= &b11110000 then
byteCount = 4
elseif thisByte >= &b11100000 then
byteCount = 3
elseif thisByte >= &b11000000 then
byteCount = 2
else // This is an invalid byte so filter it out
readIndex = readIndex + 1
continue while // Skip to the next byte immediately
end if
// Make sure we have enough bytes to make a complete character. If not, filter this out.
if ( readIndex + byteCount - 1 ) > lastIndex then
readIndex = readIndex + 1
continue while // Skip to the next byte immediately
end if
dim chunk as string = mb.StringValue( readIndex, byteCount )
if Encodings.UTF8.IsValidData( chunk ) then
mb.StringValue( writeIndex, byteCount ) = chunk
readIndex = readIndex + byteCount
writeIndex = writeIndex + byteCount
else // This can't be a leading byte so let's discard it
readIndex = readIndex + 1
end if
end if
wend
dim r as string
if writeIndex <> 0 then
r = mb.StringValue( 0, writeIndex )
r = r.DefineEncoding( Encodings.UTF8 )
end if
return r
End Function
That depends. Will Encodings.UTF8.IsValidData properly validate such sequences? If so, then my code will do it correctly too since it will just take multi-byte characters and ask Xojo if they are valid.
Just to review the logic of the code, it first determines that a string is not empty and is invalid UTF-8, then examines it byte-by-byte.
Bytes that are less than 128 are valid. Bytes that are greater than 253 are invalid. Bytes that start with &b10xxxxxx are invalid (these are continuation bytes that cannot exist without leading bytes before them; see below).
Everything else (bytes that start with &b11xxxxxx) are leading bytes so it determines how many are supposed to be in the sequence, peels those off and runs them through IsValidData. If it’s not valid, it skips that invalid leading byte and moves onto the next byte. Otherwise, it validates the entire sequence and moves past it.
Here is what will probably be the final version of the code. I added an option to convert modified UTF-8 to UTF-8, i.e., a long NULL is converted to a regular null.
Protected Function MakeValidUTF8(src As String, convertLongNULL As Boolean = False) As String
// Turn the given string into valid UTF-8.
// Filters out invalid characters, so it might return an empty string.
// If convertLongNULL is true, then it will look for the "modified UTF-8" convention
// of using &b11000000 10000000 to store a null and convert that to an ordinary
// null (&h00).
if src = "" then return src
// If the string has an encoding, and is valid in its own encoding, then convert it to UTF-8.
// My thinking here is that, if it's not valid in its defined encoding, that encoding is wrong and
// it should be treated the same as if the encoding were nil.
if src.Encoding <> nil and src.Encoding <> Encodings.UTF8 and src.Encoding.IsValidData( src ) then
src = src.ConvertEncoding( Encodings.UTF8 )
return src // We assume Xojo did the conversion correctly
end if
// If we get here, we have to start checking the bytes of the string
if Encodings.UTF8.IsValidData( src ) then
return src.DefineEncoding( Encodings.UTF8 )
end if
// If we get here, we have a non-empty string that is not valid UTF-8.
// We have to remove the invalid bytes.
dim mb as MemoryBlock = src
dim p as Ptr = mb
dim lastIndex as integer = mb.Size - 1
dim writeIndex as integer
dim readIndex as integer
while readIndex <= lastIndex
dim thisByte as integer = p.Byte( readIndex )
if thisByte <= &b01111111 then
p.Byte( writeIndex ) = thisByte
readIndex = readIndex + 1
writeIndex = writeIndex + 1
else // It's a leading byte so figure out how many valid bytes should be in the group and check them
dim byteCount as integer
if thisByte >= &b11111110 then // Invalid byte
// Do nothing
elseif thisByte >= &b11111100 then
byteCount = 6
elseif thisByte >= &b11111000 then
byteCount = 5
elseif thisByte >= &b11110000 then
byteCount = 4
elseif thisByte >= &b11100000 then
byteCount = 3
elseif thisByte >= &b11000000 then
byteCount = 2
end if
if byteCount = 0 then // Invalid byte so skip it
readIndex = readIndex + 1
// Make sure we have enough bytes to make a complete character. If not, filter this out.
elseif ( readIndex + byteCount - 1 ) > lastIndex then
readIndex = readIndex + 1
elseif convertLongNULL and byteCount = 2 and thisByte = &b11000000 and p.Byte( readIndex + 1 ) = &b10000000 then // It's a long null
p.Byte( writeIndex ) = 0
readIndex = readIndex + byteCount
writeIndex = writeIndex + 1
else
// See if the sequence headed by this leading byte is valid.
// If so, we will accept the entire sequence.
dim chunk as string = mb.StringValue( readIndex, byteCount )
if Encodings.UTF8.IsValidData( chunk ) then
mb.StringValue( writeIndex, byteCount ) = chunk
readIndex = readIndex + byteCount
writeIndex = writeIndex + byteCount
else // This can't be a leading byte so let's discard it
readIndex = readIndex + 1
end if
end if // byteCount = 0
end if // thisByte <= &b01111111
wend // readIndex <= lastIndex
dim r as string
if writeIndex <> 0 then
r = mb.StringValue( 0, writeIndex )
r = r.DefineEncoding( Encodings.UTF8 )
end if
return r
End Function
If there is an invalid UTF-8 character in the original src it gets replaced by the “replacement character” FFFD… you can then either remove them or in some cases leave it in indicating a problem.
I can’t reproduce that. For example, when I run this code:
Sub Action()
dim s as string
s = "123" + ChrB( &b10110110 )
s = s.DefineEncoding( Encodings.UTF8 )
if Encodings.UTF8.IsValidData( s ) then
AddToResult "Valid"
end if
s = s.ConvertEncoding( Encodings.UTF16 )
s = s.ConvertEncoding( Encodings.UTF8 )
if Encodings.UTF8.IsValidData( s ) then
AddToResult "Valid"
end if
End Sub
The bogus character is replaced by &hEFBFBD. It does become valid UTF-8, though, so that’s certainly a workaround. However, in another test, I added a bogus leading byte and this ultimately returned an empty string. In short, I wouldn’t do it that way.
i’m afraid i have the same problem and every time it crashes my program. Load the file “example.docx” by clicking on “Dynamic”. The file styles.xml contains the xml exception. If I open the file in a hex viewer (copy & paste the string from the debugger), it contains the string EFBFBD 2 times at the end of the first line. This surely leads to the exception, because the document.xml is parsed without errors… With your code from above, this can unfortunately not be fixed. Would you please take a look at the demo project? The error occurs with all DocX files.
I don’t think you’re using XMLReader properly. You can calling Parse on two separate documents and it doesn’t know what to make of it so it is declaring the second document “junk”. You have to parse each document separately.
Please correct me, of im wrong, but while reading the content of the files within a loop, each file will be parsed separatly.
That means what? That I need a separate XmlReader class for each file?
That’s exactly what I wanted to prevent, because I never know how many xml-files are in each archive and to keep memory consumption as low as possible. Well, I’ll see if I can get it to work with your suggestion and if the xml error still occurs. I am also open for other suggestions! Thank you.