How to get unknown Encoding of TextInputStream?

Martin_T · January 7, 2015, 8:04am

Hi,

this is driving me crazy. How can i get the Encoding of a Textinputstream? Means, after loading from Folderitem, the Textfiles should automatically converts to UTF-8. I will load Textfiles which are encoded in Unicode, Ascii, Ansi, UTF-8 (do nothing) and Ansel (not supportet by Xojo).

Beatrix_Willius · January 7, 2015, 8:29am

That’s a nice complicated topic. There are the byte order marks (BOM), which may tell you the encoding. See http://en.wikipedia.org/wiki/Byte_order_mark . I’ve got some code for this lying around.

After that comes the guessing. You can exclude invalid encodings and then hope for the best. Check out StringUtils from Joe Strout (http://www.strout.net/info/coding/rb/intro.html).

Martin_T · January 7, 2015, 8:49am

[quote=157668:@Beatrix Willius]That’s a nice complicated topic. There are the byte order marks (BOM), which may tell you the encoding. See http://en.wikipedia.org/wiki/Byte_order_mark . I’ve got some code for this lying around.

After that comes the guessing. You can exclude invalid encodings and then hope for the best. Check out StringUtils from Joe Strout (http://www.strout.net/info/coding/rb/intro.html).[/quote]
Thank you for the Tip.

What i don’t understand, if i tell the TextInputStream.Encoding the format before reading the Lines, it works correctly.

Michel_Bujardet · January 7, 2015, 1:08pm

[quote=157660:@Martin Trippensee]Hi,

this is driving me crazy. How can i get the Encoding of a Textinputstream? Means, after loading from Folderitem, the Textfiles should automatically converts to UTF-8. I will load Textfiles which are encoded in Unicode, Ascii, Ansi, UTF-8 (do nothing) and Ansel (not supportet by Xojo).[/quote]

http://documentation.xojo.com/index.php/Encoding ?

Michael_Hußmann · January 7, 2015, 9:29pm

Of course it does. Alternatively you could read the file, then assign the encoding of the resulting string using DefineEncoding which would work just as well. Using TextInputStream.Encoding is just a simpler and more concise way to achieve the same, provided the encoding is known beforehand. But whatever you do you must find out what the encoding is so you can tell Xojo which encoding to assign. In difficult cases you will have to read the string (with an unknown encoding), then check for a BOM, try likely encodings and check whether they are valid etc., as Beatrix suggested. Once you have an idea of what the encoding is you assign it with DefineEncoding if you want you could then use ConvertEncoding to convert the encoding to UTF8 or whatever.

Greg_O_Lone · January 8, 2015, 12:21pm

Don’t forget that you can check if a string is valid for a specific encoding with http://documentation.xojo.com/index.php/TextEncoding.IsValidData

Martin_T · September 3, 2015, 6:16pm

Hi,

i found this cool Snippet within the Xojo-Forum:

[code] Dim fileIdent As Uint32
Dim firstFour, s As String
Dim bs As BinaryStream

//Determine the kind of image file that we have and call the
// appropriate analyseXXX method to get width and height
// read first four bytes of the file passed in and check to see if it
// is one of the image types that this program will process.
bs = BinaryStream.Open(f, False)

if bs = nil then
s = “Unable to open the selected file “”” + f.name + " in the analyseGraphic method"
showAlertDialog kPgmErr, s, “Okay”, “”, “”, “a”, “”, 0
Return
end if

fileIdent = bs.readUint32 // read first four bytes from the file
firstFour = Right(“0000000” + hex(fileIdent), 8)

if left(firstFour, 4) = “424D” then // first two bytes 424D (BM) a BMP file?
analyseBMP
else
select case firstFour
case “47494638” // test if first four bytes are &h47494638 (GIF8) a GIF file
analyseGIF
case “FFD8FFE0” // test if first four bytes are &hFFD8FFE0 for a JPG file
analyseJPG
case “FFD8FFE1”
analyseJPG
case “89504E47” // test if first four bytes are &h89504E47 for a PNG file
analysePNG
case “4D4D002A” // is this one of the TIF identifiers?
analyseTIF
case “49492A00” // is this the other TIF identifier?
bs.LittleEndian = true // this TIF format has low byte first
analyseTIF
case “00000000” // does this look like a PICT file can’t find any written spec for PICT
analysePCT
else
imageKind = “?” // not a known image file
end Select
end if
bs.Close[/code]
It checks the Fileheader auf a BinaryStream to get the File-Format.

UTF-8 Files are signed by EF BB BF, UTF-16 (BE) by FE FF and UTF-16 (LE) by FF FE at the beginning of the File. How to edit the source to get the right Encoding? I also wanna import ASCII-Encodes Textfiles, but i don’t know, which Encoding the Files from the Users will have…

After checking, the Textfiles can load via Textinputstream with the right Encoding!

Matthew_Combatti · September 3, 2015, 7:15pm

Try this…and add any other encodings as needed…

 Shared Function GuessEncoding(s As String) As TextEncoding
  // Guess what text encoding the text in the given string is in.
  // This ignores the encoding set on the string, and guesses
  // one of the following:
  //
  //   * UTF-32
  //   * UTF-16
  //   * UTF-8
  //   * Encodings.SystemDefault
  //
  // Written by Joe Strout
  
  #pragma DisableBackgroundTasks
  #pragma DisableBoundsChecking
  
  static isBigEndian, endianChecked As Boolean
  if not endianChecked then
    Dim temp As String = Encodings.UTF16.Chr( &hFEFF )
    isBigEndian = (AscB( MidB( temp, 1, 1 ) ) = &hFE)
    endianChecked = true
  end if
  
  // check for a BOM
  Dim b0 As Integer = AscB( s.MidB( 1, 1 ) )
  Dim b1 As Integer = AscB( s.MidB( 2, 1 ) )
  Dim b2 As Integer = AscB( s.MidB( 3, 1 ) )
  Dim b3 As Integer = AscB( s.MidB( 4, 1 ) )
  if b0=0 and b1=0 and b2=&hFE and b3=&hFF then
    // UTF-32, big-endian
    if isBigEndian then
      #if RBVersion >= 2012.02
        return Encodings.UTF32
      #else
        return Encodings.UCS4
      #endif
    else
      return Encodings.UTF32BE
    end if
  elseif b0=&hFF and b1=&hFE and b2=0 and b3=0 and s.LenB >= 4 then
    // UTF-32, little-endian
    if isBigEndian then
      return Encodings.UTF32LE
    else
      #if RBVersion >= 2012.02
        return Encodings.UTF32
      #else
        return Encodings.UCS4
      #endif
    end if
  elseif b0=&hFE and b1=&hFF then
    // UTF-16, big-endian
    if isBigEndian then
      return Encodings.UTF16
    else
      return Encodings.UTF16BE
    end if
  elseif b0=&hFF and b1=&hFE then
    // UTF-16, little-endian
    if isBigEndian then
      return Encodings.UTF16LE
    else
      return Encodings.UTF16
    end if
  elseif b0=&hEF and b1=&hBB and b1=&hBF then
    // UTF-8 (ah, a sensible encoding where endianness doesn't matter!)
    return Encodings.UTF8
  end if
  
  // no BOM; see if it's entirely ASCII.
  Dim m As MemoryBlock = s
  Dim i, maxi As Integer = s.LenB - 1
  for i = 0 to maxi
    if m.Byte(i) > 127 then exit
  next
  if i > maxi then return Encodings.ASCII
  
  // Not ASCII; check for a high incidence of nulls every other byte,
  // which suggests UTF-16 (at least in Roman text).
  Dim nulls(1) As Integer  // null count in even (0) and odd (1) bytes
  for i = 0 to maxi
    if m.Byte(i) = 0 then
      nulls(i mod 2) = nulls(i mod 2) + 1
    end if
  next
  if nulls(0) > nulls(1)*2 and nulls(0) > maxi\\2 then
    // UTF-16, big-endian
    if isBigEndian then
      return Encodings.UTF16
    else
      return Encodings.UTF16BE
    end if
  elseif nulls(1) > nulls(0)*2 and nulls(1) > maxi\\2 then
    // UTF-16, little-endian
    if isBigEndian then
      return Encodings.UTF16LE
    else
      return Encodings.UTF16
    end if
  end if
  
  // it's not ASCII; check for illegal UTF-8 characters.
  // See Table 3.1B, "Legal UTF-8 Byte Sequences",
  // at <http://unicode.org/versions/corrigendum1.html>
  Dim b As Byte
  for i = 0 to maxi
    select case m.Byte(i)
    case &h00 to &h7F
      // single-byte character; just continue
    case &hC2 to &hDF
      // one additional byte
      if i+1 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      i = i+1
    case &hE0
      // two additional bytes
      if i+2 > maxi then exit for
      b = m.Byte(i+1)
      if b < &hA0 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      i = i+2
    case &hE1 to &hEF
      // two additional bytes
      if i+2 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      i = i+2
    case &hF0
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h90 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    case &hF1 to &hF3
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    case &hF4
      // three additional bytes
      if i+3 > maxi then exit for
      b = m.Byte(i+1)
      if b < &h80 or b > &h8F then exit for
      b = m.Byte(i+2)
      if b < &h80 or b > &hBF then exit for
      b = m.Byte(i+3)
      if b < &h80 or b > &hBF then exit for
      i = i+3
    else
      exit for
    end select
  next i
  if i > maxi then return Encodings.UTF8  // no illegal UTF-8 sequences, so that's probably what it is
  
  // If not valid UTF-8, then let's just guess the system default.
  return Encodings.SystemDefault
End Function

Code from Xojodevspot.com

Martin_T · September 3, 2015, 10:10pm

Thanks Mat, but how to set the String s…need to load via Textinputstream and therefore i need to know the encoding

Norman_P · September 3, 2015, 10:15pm

Use a binary stream - read it as raw data
Examine the bytes & see what you can guess
Then assign the encoding

Matthew_Combatti · September 4, 2015, 3:57am

Dim input as BinaryStream
Dim s as String

Input = BinaryStream.Open (folderitem)
S = input.read (input.length)
Input.close

Dim theEncoding as TextEncoding = GuessEncoding (s)

TheEncoding is the encoding

Martin_T · September 4, 2015, 12:33pm

@Matthew Combatti : YOU ARE THE MEN! Thank you

Beatrix_Willius · September 4, 2015, 3:23pm

Please remember that the above functions are only very rough estimates. Don’t expect wonders.

I’m parsing mail data. Mails may have a html encoding and a content type. Additionally, I’m using UniversalCharacterDetectionMBS. For my data all three can be different.