Find the encoding

I am making a software that reads different types of encoding, I would like to know if there is a method to know the file which type of encoding it uses if UTF8, UTF16, ISO etc etc!

The easy way is to check the BOM. But that really depends on your data. You can check if you have so called high ascii, multibyte chars and do some other checks. Check out the m_string module from Kem Tekinay.

But encodings can be screwed up in so many ways. And it’s hard to say when a text has the correct encoding.

And I haven’t even mentioned mojibake when encoding has been applied incorrectly.

1 Like

A given sequence of bytes may be a “valid” string in multiple encodings. The best you can do is make an educated guess.

The link to M_String, which has a GuessEncoding function.

http://www.mactechnologies.com/index.php?page=downloads#m_string

you can use this : https://documentation.xojo.com/api/text/encoding_text/encoding.html

or this “old” method here : (Kem’s method with some more encodings recognized)

And I have a version written by Joe Strout, which looks a lot like Kem’s, but it has Joe’s name in it. All of the variations look like they started out from the same codebase and were embroidered upon. Any thoughts on which one of these might be considered “most accurate”?

Obviously mine. :slight_smile:

I might well have started with Joe’s version. A few of my functions were lifted from StringUtils, just for the convenience of not having to install two modules.

And I knew that was going to be the answer… :crazy_face:

Seriously though, I should revisit that and similar functions. I’ve learned so much about text encodings since I wrote those.

1 Like

And I didn’t know that Xojo has a GuessJapaneseEncoding method. The things we learn!

and I started from Kem’s method, and added the 3 encodings at the bottom of the method, that were not properly recognized…

1 Like

This will only return the encoding previously defined in Xojo, NOT the actual encoding readed from a file.

Your code does not work. You made a bad interpretation of the the answer in stackoverflow. The characters in both tables appear in BOTH encodings, some are more frequent in one encoding than other, so you are suposed to COUNT how many characters of each list are in the text and use that to guess the encoding Statistically, NOT just at the first appearane of one of the characters. Also, you are ignoring the characters that are actually exclusive of MacRoman. Another thing, when the code only needs ONE coincidence, it is better to EXIT the loop when it is found.

I think it would be something like this:

// http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and
'MacRoman VS windows 1252

'Undefined characters in Windows 1252
Dim found As Boolean = False
For i = 0 To maxi
  If m.Byte (i) = &h81 Then 
    found = True
    Exit For
  End If
  If m.Byte (i) = &h8D Then 
    found = True
    Exit For
  End If
  If m.Byte (i) = &h8F Then 
    found = True
    Exit For
  End If
  If m.Byte (i) = &h90 Then 
    found = True
    Exit For
  End If
  If m.Byte (i) = &h9D Then 
    found = True
    Exit For
  End If
Next
If found Then Return Encodings.MacRoman


'Statistical approach
Dim MacRomanCars As Integer
Dim AnsiCars As Integer

For i = 0 To maxi
  If m.Byte (i) = &h8e Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &h8f Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &h9a Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &ha1 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &ha5 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &ha8 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &hd0 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &hd1 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &hd5 Then MacRomanCars = MacRomanCars + 1
  If m.Byte (i) = &he1 Then MacRomanCars = MacRomanCars + 1
  
  
  If m.Byte (i) = &h92 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &h95 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &h96 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &h97 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &hae Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &hb0 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &hb7 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &he8 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &he9 Then AnsiCars = AnsiCars + 1
  If m.Byte (i) = &hf6 Then AnsiCars = AnsiCars + 1
Next


If AnsiCars >= MacRomanCars Then 
  Return Encodings.WindowsANSI
Else
  Return Encodings.MacRoman
End If

As for the last part, those 5 characters are not exclusive of the DOSLatinUS.