UTF-8 XML and weird characters

I was working with some XML and was getting a response back from the Remote server.

Strangely, the data was filled with garbage characters ASCII 65533. So I stripped out anything that was 65533 before I deconstructed XML and the problems all went away.

Now I have a problem returning and XOJO 4.1 is giving me major problems.

IS there an easy way to force the XML to be UTF-8 only?

That’s an odd character as it is not valid UTF-8 at all. At first I thought it might be a BOM, but this works out to &hFFFD and a BOM is &hFFFE or &hFEFF, and that’s for UTF-16.

The characters &hFE and &hFF are never valid in UTF-8, so you can safely filter those out, but &hFD can be valid as a leading byte, so that would be harder. You’d have to write a method to filter out invalid UTF-8 bytes. This can be done through a MemoryBlock and should be pretty quick and not too hard.

Are these characters just appearing randomly throughout the text?

OK, I’ve added this method to my M_String module, but have not posted the update to my web site yet. It ensures a string is made into valid UTF-8, discarding whatever doesn’t make sense. Comments welcome:

Protected Function MakeValidUTF8(src As String) As String
  // Turn the given string into valid UTF-8.
  // Filters out invalid characters, so it might return an empty string.
  
  if src.Encoding = nil then
    src = src.DefineEncoding( Encodings.UTF8 )
  elseif src.Encoding <> Encodings.UTF8 then
    src = src.ConvertEncoding( Encodings.UTF8 )
  end if
  
  if src = "" or Encodings.UTF8.IsValidData( src ) then
    return src
  end if
  
  // If we get here, we have a non-empty string defined as UTF8, but it's not valid.
  // We have to remove the invalid bytes.
  
  dim mb as MemoryBlock = src
  dim p as Ptr = mb
  dim lastIndex as integer = mb.Size - 1
  dim writeIndex as integer
  dim readIndex as integer
  while readIndex <= lastIndex
    dim thisByte as integer = p.Byte( readIndex )
    if thisByte <= &b01111111 then
      p.Byte( writeIndex ) = thisByte
      readIndex = readIndex + 1
      writeIndex = writeIndex + 1
      
    elseif thisByte >= &b11111110 then // Invalid byte
      readIndex = readIndex + 1
      
    else // It's a leading byte so figure out how many valid bytes should be in the group and check them
      dim byteCount as integer
      if thisByte >= &b11111100 then
        byteCount = 6
      elseif thisByte >= &b11111000 then
        byteCount = 5
      elseif thisByte >= &b11110000 then
        byteCount = 4
      elseif thisByte >= &b11100000 then
        byteCount = 3
      elseif thisByte >= &b11000000 then
        byteCount = 2
      else // This is an invalid byte so filter it out
        readIndex = readIndex + 1
        continue while // Skip to the next byte immediately
      end if
      
      // Make sure we have enough bytes to make a complete character. If not, filter this out.
      if ( readIndex + byteCount - 1 ) > lastIndex then
        readIndex = readIndex + 1
        continue while // Skip to the next byte immediately
      end if
      
      dim chunk as string = mb.StringValue( readIndex, byteCount )
      if Encodings.UTF8.IsValidData( chunk ) then
        mb.StringValue( writeIndex, byteCount ) = chunk
        readIndex = readIndex + byteCount
        writeIndex = writeIndex + byteCount
      else // This can't be a leading byte so let's discard it
        readIndex = readIndex + 1
      end if
      
    end if
  wend
  
  dim r as string
  if writeIndex <> 0 then
    r = mb.StringValue( 0, writeIndex )
    r = r.DefineEncoding( Encodings.UTF8 )
  end if
  
  return r
  
End Function

I haven’t read it too closely, but does it check for overlong encodings and invalid codepoints?

That depends. Will Encodings.UTF8.IsValidData properly validate such sequences? If so, then my code will do it correctly too since it will just take multi-byte characters and ask Xojo if they are valid.

Just to review the logic of the code, it first determines that a string is not empty and is invalid UTF-8, then examines it byte-by-byte.

Bytes that are less than 128 are valid. Bytes that are greater than 253 are invalid. Bytes that start with &b10xxxxxx are invalid (these are continuation bytes that cannot exist without leading bytes before them; see below).

Everything else (bytes that start with &b11xxxxxx) are leading bytes so it determines how many are supposed to be in the sequence, peels those off and runs them through IsValidData. If it’s not valid, it skips that invalid leading byte and moves onto the next byte. Otherwise, it validates the entire sequence and moves past it.

Here is what will probably be the final version of the code. I added an option to convert modified UTF-8 to UTF-8, i.e., a long NULL is converted to a regular null.

Protected Function MakeValidUTF8(src As String, convertLongNULL As Boolean = False) As String
  // Turn the given string into valid UTF-8.
  // Filters out invalid characters, so it might return an empty string.
  // If convertLongNULL is true, then it will look for the "modified UTF-8" convention
  // of using &b11000000 10000000 to store a null and convert that to an ordinary
  // null (&h00).
  
  if src = "" then return src
  
  // If the string has an encoding, and is valid in its own encoding, then convert it to UTF-8.
  // My thinking here is that, if it's not valid in its defined encoding, that encoding is wrong and 
  // it should be treated the same as if the encoding were nil.
  if src.Encoding <> nil and src.Encoding <> Encodings.UTF8 and src.Encoding.IsValidData( src ) then
    src = src.ConvertEncoding( Encodings.UTF8 )
    return src // We assume Xojo did the conversion correctly 
    end if
  
  // If we get here, we have to start checking the bytes of the string
  if Encodings.UTF8.IsValidData( src ) then
    return src.DefineEncoding( Encodings.UTF8 )
  end if
  
  // If we get here, we have a non-empty string that is not valid UTF-8.
  // We have to remove the invalid bytes.
  
  dim mb as MemoryBlock = src
  dim p as Ptr = mb
  dim lastIndex as integer = mb.Size - 1
  dim writeIndex as integer
  dim readIndex as integer
  while readIndex <= lastIndex
    
    dim thisByte as integer = p.Byte( readIndex )
    if thisByte <= &b01111111 then
      p.Byte( writeIndex ) = thisByte
      readIndex = readIndex + 1
      writeIndex = writeIndex + 1
      
    else // It's a leading byte so figure out how many valid bytes should be in the group and check them
      dim byteCount as integer
      if thisByte >= &b11111110 then // Invalid byte
        // Do nothing
      elseif thisByte >= &b11111100 then
        byteCount = 6
      elseif thisByte >= &b11111000 then
        byteCount = 5
      elseif thisByte >= &b11110000 then
        byteCount = 4
      elseif thisByte >= &b11100000 then
        byteCount = 3
      elseif thisByte >= &b11000000 then
        byteCount = 2
      end if
      
      if byteCount = 0 then // Invalid byte so skip it 
        readIndex = readIndex + 1
        
        // Make sure we have enough bytes to make a complete character. If not, filter this out.
      elseif ( readIndex + byteCount - 1 ) > lastIndex then
        readIndex = readIndex + 1
        
      elseif convertLongNULL and byteCount = 2 and thisByte = &b11000000 and p.Byte( readIndex + 1 ) = &b10000000 then // It's a long null
        p.Byte( writeIndex ) = 0
        readIndex = readIndex + byteCount
        writeIndex = writeIndex + 1
        
      else
        
        // See if the sequence headed by this leading byte is valid.
        // If so, we will accept the entire sequence.
        dim chunk as string = mb.StringValue( readIndex, byteCount )
        if Encodings.UTF8.IsValidData( chunk ) then
          mb.StringValue( writeIndex, byteCount ) = chunk
          readIndex = readIndex + byteCount
          writeIndex = writeIndex + byteCount
        else // This can't be a leading byte so let's discard it
          readIndex = readIndex + 1
        end if
        
      end if // byteCount = 0
      
    end if // thisByte <= &b01111111
    
  wend // readIndex <= lastIndex
  
  dim r as string
  if writeIndex <> 0 then
    r = mb.StringValue( 0, writeIndex )
    r = r.DefineEncoding( Encodings.UTF8 )
  end if
  
  return r
End Function

You might try something like this:

src = DefineEncoding(src, Encodings.utf8)
src = ConvertEncoding(src, Encodings.UTF16)
src = ConvertEncoding(src, Encodings.UTF8)

If there is an invalid UTF-8 character in the original src it gets replaced by the “replacement character” FFFD… you can then either remove them or in some cases leave it in indicating a problem.

I can’t reproduce that. For example, when I run this code:

Sub Action()
  dim s as string
  
  s = "123" + ChrB( &b10110110 )
  s = s.DefineEncoding( Encodings.UTF8 )
  
  if Encodings.UTF8.IsValidData( s ) then
    AddToResult "Valid"
  end if
  
  s = s.ConvertEncoding( Encodings.UTF16 )
  s = s.ConvertEncoding( Encodings.UTF8 )
  
  if Encodings.UTF8.IsValidData( s ) then
    AddToResult "Valid"
  end if
End Sub

The bogus character is replaced by &hEFBFBD. It does become valid UTF-8, though, so that’s certainly a workaround. However, in another test, I added a bogus leading byte and this ultimately returned an empty string. In short, I wouldn’t do it that way.

Hello @Kem,

i’m afraid i have the same problem and every time it crashes my program. Load the file “example.docx” by clicking on “Dynamic”. The file styles.xml contains the xml exception. If I open the file in a hex viewer (copy & paste the string from the debugger), it contains the string EFBFBD 2 times at the end of the first line. This surely leads to the exception, because the document.xml is parsed without errors… With your code from above, this can unfortunately not be fixed. Would you please take a look at the demo project? The error occurs with all DocX files.

Thanks

I don’t think you’re using XMLReader properly. You can calling Parse on two separate documents and it doesn’t know what to make of it so it is declaring the second document “junk”. You have to parse each document separately.

Please correct me, of i’m wrong, but while reading the content of the files within a loop, each file will be parsed separatly.
That means what? That I need a separate XmlReader class for each file?

No, you need to use a separate XMLReader for each, and that’s not how your code is written.

That’s exactly what I wanted to prevent, because I never know how many xml-files are in each archive and to keep memory consumption as low as possible. Well, I’ll see if I can get it to work with your suggestion and if the xml error still occurs. I am also open for other suggestions! Thank you.

I tested and the styles file parses properly on its own.

Ok, then I don’t understand why it needs XmlReader.Reset.

Ah, I haven’t used XMLReader in the past, but a Reset between files should do it. I haven’t tested that though.

During my attempts to always call Call XmlReader.Reset after parsing, nothing fit in the loop anymore.