Force encoding?

I have a text file that appears to be in some extended UTF format.
It begins with FFFE, and then ‘ordinary text’ seems to be padded with character zeros

image

If I try to convert encodings to coerce it down to ASCII or UTF8, it doesnt change.
Is there a way to make that become

image

That’s UTF-16 and the FFFE at the beginning is the Byte Order Marker. You should be able to do this to get it into a usable string:

s=s.DefineEncoding(Encodings.UTF16)

ConvertEncoding is no use as long as you don’t know which encoding the string is in to begin with. First use DefineEncoding (Eric appears to be right – it’s UTF-16), and only then can you convert it to whichever encoding you prefer.

Hmmm.
I think I see.. however I won’t know in advance what I will be getting.

Why not? What’s the source of the data?

It could come from anywhere.

Basically ‘gimme a text file’

I’ve had files with a variety of encodings, depending upon who sent it and the original source of the data.

Well - the best way, of course, is to have the submitter tell you what the encoding is.

Beyond that, I think you’re looking at a process of elimination:

  1. Is there a UTF BOM? If so, follow it - with the understanding that even this isn’t 100% reliable, as you could be receiving a file using an old codepage encoding; unlikely, since a UTF BOM will look like gobbedygook in any other encoding.
  2. Does the entire stream of data consist of solely bytes < 128? Assume it is ASCII.
  3. If not 1 or 2, then you’re getting into the weeds. There isn’t a 100% reliable way to detect encodings. You could work up an algorithm to look at the bytes and make educated guesses, to be approved by the user.

In my recent experience, the vast majority of files in Western languages are in ASCII, UTF-8, UTF-16, Windows Latin-1, and Macintosh encodings.

1 Like

The Text encoding class has a method called IsValidData which should help with this.

But this just validates whether the data is valid - it doesn’t indicate whether it is correct. I’m pretty sure data in any of the old 8-bit codepages would validate just fine, while giving no insight as to which encoding it actually is.

Encodings.ASCII.IsValid is probably the only encoding where valid → accurate.

I’m not so sure about that. If you were to give UTF-16 data to UTF-8 and ask it if it’s valid, I would expect it to return false. Honestly I’ve never had to deal with this though.

edit:

I just did a quick test using some text that’s UTF-16 and used IsValidData on all of the encodings that Xojo has access to on my Mac. Of the 93 encodings that are available 44 of them returned false and 49 returned true. Ascii was among the false ones, but UTF-8, 16 and 32 all returned true.

Right. So it can say if an encoding is NOT accurate for the text, but not if it is the correct encoding. As you saw, UTF-16 shows a false positive on a ton of encodings, probably all the old 8-bit codepages.

Where is it coming from?

I have no control over this.
There are potentially hundreds of clients, providing text files that could have been made by copying from PDFs, using internet based conversion apps, manually typing, exporting from Excel… anything.

It’s easy to manually open the files in Notepad and save as ANSI.
What I need to work out is how to do that in Xojo code, or try to explain to non-tech-savvy people about encodings, and why their seemingly fine text files cannot be processed.

(Im not alone - I’ve spent 20 years supporting one major software product that also cannot process a text file that has a BOM, and I know of several others.)

There’s a python library called chardet that does a decent job at determining the encoding of a text file and providing it as a string along with a confidence score. It also ships with a command-line tool that your app can shell to. Or, you could use the Python classes in MBS or Einhugur Plugins to try to run the Python inline in Xojo.

Alternatively, you can try to port it to Xojo but that will be a project in and of itself.

What’s in these files?

If there was a common word, phrase, number, etc. that you could rely on to be in every file, that could act as a clue as to the encoding.

You could even require the submitters to type a magic word at the beginning of the file like “comó” (note the accent) that your code could use a check for the encoding.

There’s always @Kem_Tekinay’s GuessEncoding function.

This is from the M_String project and includes the bug fix noted here.

It doesn’t cover all encodings, of course, but it does a good job with BOM used by unicode. I use it in a couple of my projects.

Protected Function GuessEncoding(s As String, ByRef outWrongOrder As Boolean) As TextEncoding
  // Guess what text encoding the text in the given string is in.
  // This ignores the encoding set on the string, and guesses
  // one of the following:
  //
  //   * UTF-32
  //   * UTF-16
  //   * UTF-8
  //   * Encodings.SystemDefault
  //
  // If the UTF-32 or UTF-16 is in the wrong byte order for this platform,
  // then outWrongOrder will be set to true.
  
  static isBigEndian, endianChecked As Boolean
  if not endianChecked then
    Dim temp As String = Encodings.UTF16.Chr( &hFEFF )
    isBigEndian = (AscB( MidB( temp, 1, 1 ) ) = &hFE)
    endianChecked = true
  end if
  
  // check for a BOM
  Dim b0 As Integer = AscB( s.MidB( 1, 1 ) )
  Dim b1 As Integer = AscB( s.MidB( 2, 1 ) )
  Dim b2 As Integer = AscB( s.MidB( 3, 1 ) )
  Dim b3 As Integer = AscB( s.MidB( 4, 1 ) )
  if b0=0 and b1=0 and b2=&hFE and b3=&hFF then
    // UTF-32, big-endian
    outWrongOrder = not isBigEndian
    #if RBVersion < 2012.02
      return Encodings.UCS4
    #else
      return Encodings.UTF32
    #endif
  elseif b0=&hFF and b1=&hFE and b2=0 and b3=0 and s.LenB >= 4 then
    // UTF-32, little-endian
    outWrongOrder = isBigEndian
    #if RBVersion < 2012.02
      return Encodings.UCS4
    #else
      return Encodings.UTF32
    #endif
  elseif b0=&hFE and b1=&hFF then
    // UTF-16, big-endian
    outWrongOrder = not isBigEndian
    return Encodings.UTF16
  elseif b0=&hFF and b1=&hFE then
    // UTF-16, little-endian
    outWrongOrder = isBigEndian
    return Encodings.UTF16
  elseif b0=&hEF and b1=&hBB and b2=&hBF then
    // UTF-8 (ah, a sensible encoding where endianness doesn't matter!)
    return Encodings.UTF8
  end if
  
  // no BOM; see if it's entirely ASCII.
  Dim m As MemoryBlock = s
  Dim i, maxi As Integer = s.LenB - 1
  for i = 0 to maxi
    if m.Byte(i) > 127 then exit
  next
  if i > maxi then return Encodings.ASCII
  
  // Not ASCII; check for a high incidence of nulls every other byte,
  // which suggests UTF-16 (at least in Roman text).
  Dim nulls(1) As Integer  // null count in even (0) and odd (1) bytes
  for i = 0 to maxi
    if m.Byte(i) = 0 then
      nulls(i mod 2) = nulls(i mod 2) + 1
    end if
  next
  if nulls(0) > nulls(1)*2 and nulls(0) > maxi\2 then
    // UTF-16, big-endian
    outWrongOrder = not isBigEndian
    return Encodings.UTF16
  elseif nulls(1) > nulls(0)*2 and nulls(1) > maxi\2 then
    // UTF-16, little-endian
    outWrongOrder = isBigEndian
    return Encodings.UTF16
  end if
  
  // it's not ASCII; check for illegal UTF-8 characters.
  // See Table 3.1B, "Legal UTF-8 Byte Sequences",
  // at <http://unicode.org/versions/corrigendum1.html>
  Dim b As Byte
  for i = 0 to maxi
    select case m.Byte(i)
    case &h00 to &h7F
      // single-byte character; just continue
    case &hC2 to &hDF
      // one additional byte
      if i+1 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      i = i+1
    case &hE0
      // two additional bytes
      if i+2 > maxi then exit for
      b = m.Byte(i+1)
      if b < &hA0 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      i = i+2
    case &hE1 to &hEF
      // two additional bytes
      if i+2 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      i = i+2
    case &hF0
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h90 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    case &hF1 to &hF3
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    case &hF4
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &h8F then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    else
      exit for
    end select
  next i
  if i > maxi then return Encodings.UTF8  // no illegal UTF-8 sequences, so that's probably what it is
  
  // If not valid UTF-8, then let's just guess the system default.
  return Encodings.SystemDefault
  
End Function
2 Likes

I would do similar

  • UTF BOM
  • UTF8 IsValidData
  • Use UniversalCharacterDetectionMBS class to check for 8bit encoding.
  • Default encoding for platform like Windows ANSI for Windows.

The UniversalCharacterDetectionMBS class does some probability checks. Like what the distribution of bytes is.

1 Like

Check the first few bytes for a byte order mark:

If one is not present you could use the character set detection features in MBS which I think uses this project:

As a last resort you will probably need to have a list of character sets to choose from and possibly some kind of preview window that shows the effect of changing the encoding on the text.

That’s what I did. I put to gather a selector for Encoding and then show the data from the first 10-20 rows of the file. Mine are delimited columns of data so I show it in a grid format. Seems to work well.

Not elegant, but ReplaceAll(sLine, Chr(0),“”) worked for me.