Encoding, Charsets.. I don't understand


Frist of all, I don’t have a clue what the difference is between charsets and encoding types. The more I read, the more confused I get. I understand that ‘Encoding’ is more how things are stored and ‘charsets’ are assignments of the different characters to their addresses. But I can’t seem to find a clear answer on this.
Is UTF-8 and encoding or a charset?
If it’s an encoding type, can the same charsets be included in different encoding types?

But to be honest… if I find out, I wonder if that will solve my problem.

Anyways, here’s my struggle.
I’m getting international text files and need to display them properly. I don’t know upfront what encoding they’re in and what language they are.

After trying to do things in Xojo (excluding types, read the BOM etc.) I got pretty much stuck.
I found a Go(lang) library based on CharDet (The Universal Character Encoding Detector) so I made a quick helper App. This will ‘guess’ the encoding and also returns a confidence score.

I downloaded some subtitles (so I know the correct country the files originated from) and feeded them.
I didn’t know the encoding upfront of any of those files.

To give an idea, here is the output:

#language, detected language, detected charset/encoding

Not great but at least it’s something.
But now I want to read them into Xojo.

When reading a textfile In Xojo, I can define the encoding with something like this:

TextArea1.AppendText(stream.Read(255, Encodings.XYZ))

…where XYZ is the encoding type.

But the encoding types in Xojo don’t match. Xojo doesn’t have a ‘windows-1252’, ‘iso-8859-1’, ‘iso-8859-8-l’ etc.
Is this because these are something completely different? Or do I need to map them somehow?

Or are these what Xojo calls the ‘internetName’? If so, how can I define the internetName as the encoding/charset?

I hope someone can point me in the right direction.

Generally, a “charset” is an entire set of characters (A-Z, a-z, 0-9, punctuation, etc.) and an “encoding” is the charset which is used to encode a specific string. These terms are more-or-less interchangeable in most contexts.

Different encodings might have several competing names. Names like “windows-1252” are the IANA-recommended format and corresponds to the InternetName property of the TextEncoding class. You can loop through all supported TextEncodings to find the one you want:

Function LocateEncoding(InternetName As String) As TextEncoding
  For i As Integer = 0 To Encodings.Count
    If Encodings.Item(i).internetName = InternetName Then
      Return Encodings.Item(i)
    End If
End Function

Encoding and “char set” are both referred to as “encoding” in Xojo. A single-byte encoding is what you’re describing as a char set. The UTF encodings all reference the Unicode character set. But while you can guess the encoding somewhat of UTF-encoded files, determining the encoding of a single-byte character set is… well, I’d think it was impossible, but then, I haven’t examined every one either.

Are you sure they aren’t encoded using one of the UTF encodings? If you post one or two of the files somewhere (if you can), perhaps some of us can help.

[quote=193665:@Marco Hof]#language, detected language, detected charset/encoding

For Czech at least, it cannot be CP-1252. The Windows ANSI character set for it is CP-1250 https://en.wikipedia.org/wiki/Windows-1250

What is the format of the files you are getting ? Modern text formats such as Word or RTF do provide adequate restitution of multilingual information. Best way to go is to load them into Word or Open Office, and save them in RTF which Xojo can read into a TextArea with StyledText. If what you are getting is .Txt files, then you must know which platform generated them, and since there is little standardization, it will be difficult to guess.

For having produced foreign fonts back in the eighties when Unicode was not here yet, I remember the several “standards”, in particular for Cyrillic and Eastern European languages. Documents entered in such character sets would require today heavy transliteration.


Think @Paul Lefebvre has scheduled an interesting subject for a webinar:
July 7 (1-2 PM ET): Xojo Framework: Text and Encodings

Thanks so much for the replies. I’m still trying to read and understand everything and I’m really looking forward to Paul’s webinar.

I gave up on the helper App. I can’t get it to work correctly.

This whole thing frustrates me because I know it can be done somehow. I just don’t have a clue how.
I know this because there are Apps that handle it well.

For example, on my Mac I use iFlicks for my videos. Besides that does conversion on my videos, it can also add subtitles.

For some reason, iFlicks handles the subtitle files perfectly ok. (Well, not 100% but 9 out of 10 times).

Here are some example subtitles that I’m trying to load in correctly into Xojo correctly.

example subs

Here is a screenshot of iFlicks. It loads in these subtitles fine. It recognizes the encoding and even the language correctly.

iFlicks doesn’t know the encoding or the language upfront.

I also tried to change the filenames to 1.srt, 2.srt etc. just to make sure iFlicks is not guessing anything from the filename. Everything is the same except the Bosnian subtitle. If the file is called 1.srt, it thinks the language is Croatian. So something must be going on there as well.

No luck so far but I’m now trying to find other programs that handle these files correctly as well. I hope to find something open source so I can see how it’s done.

I don’t have Windows but I heard that Notepad++ also does a great job recognizing the encoding.

The only thing I can think of right now is that iFlicks does some very smart guessing using some kind of library or wordlist that can guess the language. But even with that, in order to hold these files against it, I still need to figure out the encoding first.
Or maybe not and I’m overlooking something really simple.

ok those files HAVE a country code in the name
(see https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes for ISO 639-1 two letter codes)
Game.of.Thrones.S05E09.bs.srt << bs = bosnian
Game.of.Thrones.S05E09.el.srt << el = greek
Game.of.Thrones.S05E09.he.srt << he = hebrew
Game.of.Thrones.S05E09.ro.srt << ro = romanian
Game.of.Thrones.S05E09.ru.srt << ru = russian

Thats one hint
The text in the files doesn’t identify the language

I’d actually rename one and open it in iFLicks & see how well it does :stuck_out_tongue:

I did. As I wrote above. I renamed them to 1.srt, 2.srt etc. to take the ISO 639-2 language codes out.
(Not country codes. As in ‘ar’ is Arabic. Not Argentina.)

So when loading in the renamed ones, all was the same except the Bosnian language was now recognized as Croatian language. However, still the correct encoding.
All the other files’ languages and encodings were the same. With or without the full filenames.

Another interesting thing of iFlicks is when I open the dropdown of the encoding or language field, it shows a list with a ton of other options. I guess in case iFlicks guessed it wrong. Those are pretty impressive lists.

Here some screencaps: screencaps

I don’t know if this is either an amazing job from the guy who made it, something simple I’m overlooking or a nifty library that I cannot find. Either way, it shows that it can be done.

library - ICU :stuck_out_tongue:

On OS X its basically built in

And there are some possible API’s I could see help out like https://developer.apple.com/library/prerelease/ios/documentation/Cocoa/Reference/NSOrthography_Class/

You could always ask the guy who wrote iFlick how he did this.

Guessing encodings won’t ever be 100%. But this doesn’t help you if you don’t understand the basics.