Defining encoding of an undefined file?

Chris_Verberne · September 13, 2014, 11:22am

Hi everyone,

My application need to know the encoding of files I receive of which I do (or my application) does not know the encoding.

I cannot use “DefineEncoding” because with that function, the encoding has to be know in advance.

I searched this forum and the internet but I cannot find what I need. Any links or references to threads on this forum will be fine.

Any ideas how I can get the encoding in Xojo?

Greetings,

Chris

Michel_Bujardet · September 13, 2014, 11:33am

Try http://documentation.xojo.com/index.php/GetTextEncoding

Michael_Hußmann · September 13, 2014, 11:42am

There is only so much Xojo can do for you here. Basically it is your task to find out what the encoding might be. You should look whether a BOM is present; that would provide a clue. And you could use TextEncoding.IsValidData to check whether a string makes sense in several likely encodings.

Chris_Verberne · September 13, 2014, 12:15pm

Hello Michel and Michael,

Thank you very much for your replies, they are very much appreciated.

I already looked in the Xojo documentation and did not found anything (I see that GetTextEncoding page too) with which I can work with.

There is also no Byte Order Mark present.

However I experimented a little and I wrote a short piece of code as an example. There is no error checking. The user gets a dialog to select the first file, after which the whole content is read in the string “strValue”. The file closes and its encoding is stored in “encIObit_1”. Then another dialog is presented to choose the second file. Again the content is read, file closed and the encoding for the second file is stored in “encIObit_2”

Then I compare the encodings, which must be the same. In my case I tested with the original IObit language and the file I translated. Both are the same, so everything is fine.

Here is the example code :

[code]Dim f_Bestand As FolderItem
Dim tisBestand As TextInputStream
Dim strValue As String // Just the string used for testing
Dim encIObit_1 As TextEncoding // Encoding string/file 1
Dim encIObit_2 As TextEncoding // Encoding string/file 2

// Choose the first file
f_Bestand = GetOpenFolderItem(fltIOBit.IObitLanguage) // ask for the first file
tisBestand = TextInputStream.Open(f_Bestand)
strValue = tisBestand.ReadAll // Read the whole content of the file in this string
tisBestand.Close
encIObit_1 = strValue.Encoding // Get the encoding of the first file

// Choose second file
f_Bestand = GetOpenFolderItem(fltIOBit.IObitLanguage) // ask for the second file
tisBestand = TextInputStream.Open(f_Bestand)
strValue = tisBestand.ReadAll // Read the whole content of the file in this string
tisBestand.Close
encIObit_2 = strValue.Encoding // Get the encoding of the second file

If encIObit_1 <> encIObit_2 Then
MsgBox(“TextEncoding is different”)
Else
MsgBox(“TextEncoding is the same as it should be”)
End If[/code]

It is obvious that Xojo can compare the IObit original file and my own file. So I think that I can set the correct encoding for the file I created by :

strTranslatedText = DefineEncoding(strValue, encIObit_2)

where :
strTranslatedText = the text I just translated and are going to save
strValue = the original text from the unknown file

Can I ask you, if the checking method I just created is safe?

Thank you very much again for your time spend on my problem. I really do appreciate it because when I get this to work, I will save a lot of time on those language translations. Xojo does perfectly what I want during the translation process, this is only the last step which was holding me back. I will mention you too in the “About…” window.

Wish you both a very nice day and all the best.

Greetings,

Chris

Michael_Hußmann · September 13, 2014, 12:49pm

Did you check which encoding strValue.Encoding returns (just call “MsgBox encIObit_1.InternetName”)? In this case I would assume the encodings of both files to be unknown, in which case they would trivially be the same both unknown.

However if you know for certain that some piece of text must be present in the file you could tentatively define some encoding for the string, then check whether that text is present (obviously you would first check TextEncoding.IsValidData to make sure the presumed encoding is valid at all). If the text isnt found you try a different encoding, and so on.

Beatrix_Willius · September 13, 2014, 12:53pm

The bane of my existence: guessing encodings.

There is an old method in the StringUtils module from Joe Strout, which checks BOM etc. The MBS plugin has UniversalCharacterDetectionMBS, which is from Mozilla as far as I remember. Both methods sometimes are wrong. But it’s better than nothing.

Norman_P · September 13, 2014, 3:02pm

You can also use the TextEncoding.IsValidData method to filter out possible encodings.
It won’t tell you which it is but it will eliminate ones where the data is not valid for the encoding.
For instance EVERY single byte encoding can hold ANY data since, by definition, the encoding is all single byte values.
But UTF8 UTF16 and UTf32 may not as there are runs of values that are illegal.

Chris_Verberne · September 13, 2014, 4:45pm

Michael :
I tried to use your suggestion and used two MsgBox dialogs to determine the .InternetName. Both files are blank, so you are correct the original one is exactly the same as the one I created. So at least the encoding is not changed, so theoretically both files are fine.

I tried the following code (I started with UTF8 and moved on) :

If Encodings.ISOLatin1.IsValidData(strValue) Then MsgBox("Encoding is valid") Else MsgBox("Encoding is NOT valid") End if
After ISOLatin1 I received the message that the encoding is valid. But I moved on also found out that ISOLatin2, 3, 4 and 5 where valid. From ISOLatin6 and onwards, the encoding started again being invalid.

Also UTF16BE is valid where UTF16LE shows invalid.

I think it is safe to follow the next steps :

1 Open the original English language file
2 Read the first line
3 Get the encoding of the string containing the first line
4 Show the value in a textfield
5 Change the value in that textfield to the Flemish language
6 Assign the encoding of the original line to the string containing the Flemish translation
7 Save the translated line to a new file
8 Move on to the next line and start at point 2 again (no longer using the first but the 2, 3… line)

I think the encoding of the original English file will be respected when writing to the translated Flemish file. If this is the case, it is OK and exactly what I intended. Should be nice to show the encoding when I am translating, however it is not necessary. I would only be a nice addition, however I think it would not be very accurate (based on my findings above).

Beatrix :
Thank you for your information. I looked up the StringUtils module and found GuessEncoding. However if the above procedure works, I am going to use that.

Norman :
Like you suggested I tried out IsValidData with the results shown above. Some like UTF16BE, ISOLatin1 and others where all valid while others where invalid. As long as the encoding does not change between reading the original file and save it back to another textfile generated by Xojo it is fine. Your blog post about encodings from January 2013 is also very interesting and put me also on my way to understand encodings better. Thank you for your nice work!

Wish you all a very nice day.

Greetings,

Chris

Tim_Hare · September 13, 2014, 5:33pm

Your procedure is not valid. Go with StringUtils.

Peter_Stys · September 13, 2014, 6:26pm

Begs the question: why doesn’t everyone just adopt UTF8 and be done with it. I still find gremlin characters in various places in my app no matter how hard I try. This encoding is a major pain

Kem_Tekinay · September 13, 2014, 8:30pm

I have a comprehensive method for this in my M_Srring package too. Unfortunately, like any such method, it can only distinguish the multiple byte encodings.

http://www.mactechnologies.com/downloads

Norman_P · September 14, 2014, 5:25pm

working with encodings in some languages is far more painful that it is in Xojo (try things like Cobol, C, and a few others)
trying to get “everyone” to adopt any kind of standard is an endeavor thats nearly futile (heck we can’t even agree on which side the steering wheel should be on world wide or which side of the road to drive on)
Or Metric
OS level differences in their support for strings in various forms - some prefer UTF-16 some prefer UTF-8
herding cats

Michel_Bujardet · September 14, 2014, 5:47pm

Thank you Norman.

Norman_P · September 14, 2014, 5:48pm

I always loved that commercial

Chris_Verberne · September 15, 2014, 10:56am

Hello Kern,

Thank you very much for sharing your code with us. You are one of those people who make a difference to uncountable others. I do appreciate you both professionally and as a person.

Regretfully tomorrow, my docter will carry out an infiltration on my right hand, after which I cannot use that hand for a few days. I will implement your function when my hand is better.

Thank you again very much.

Wish you a very nice day and all the best.

Friendly greetings,

Chris