Frist of all, I don’t have a clue what the difference is between charsets and encoding types. The more I read, the more confused I get. I understand that ‘Encoding’ is more how things are stored and ‘charsets’ are assignments of the different characters to their addresses. But I can’t seem to find a clear answer on this.
Is UTF-8 and encoding or a charset?
If it’s an encoding type, can the same charsets be included in different encoding types?
But to be honest… if I find out, I wonder if that will solve my problem.
Anyways, here’s my struggle.
I’m getting international text files and need to display them properly. I don’t know upfront what encoding they’re in and what language they are.
After trying to do things in Xojo (excluding types, read the BOM etc.) I got pretty much stuck.
I found a Go(lang) library based on CharDet (The Universal Character Encoding Detector) so I made a quick helper App. This will ‘guess’ the encoding and also returns a confidence score.
I downloaded some subtitles (so I know the correct country the files originated from) and feeded them.
I didn’t know the encoding upfront of any of those files.
To give an idea, here is the output:
#language, detected language, detected charset/encoding cz,cz,windows-1252 ru,none,UTF-8 nl,nl,ISO-8859-1 he,he,ISO-8859-8-I fi,nl,ISO-8859-1 es,es,ISO-8859-1 ar,ar,windows-1256 ro,ro,ISO-8859-1
Not great but at least it’s something.
But now I want to read them into Xojo.
When reading a textfile In Xojo, I can define the encoding with something like this:
…where XYZ is the encoding type.
But the encoding types in Xojo don’t match. Xojo doesn’t have a ‘windows-1252’, ‘iso-8859-1’, ‘iso-8859-8-l’ etc.
Is this because these are something completely different? Or do I need to map them somehow?
Or are these what Xojo calls the ‘internetName’? If so, how can I define the internetName as the encoding/charset?
I hope someone can point me in the right direction.