Encoding questions

I’m getting a lot of text files from all over the world in different languages in different encodings.
I’m detecting the ones I can based on the BOM. When nothing found, I run them through ‘Encodings.xyz.IsValidData(s)’ for all possible encodings and present the user with a list so he can pick the right one.

I have a few questions:

  • If UTF-8 is coming out of isValidData (because the BOM wasn’t set), I present that as the default. If not UTF-8, what are the most common other ones I can present as the default? (I don’t know the language of the text file upfront)

  • The Xojo list of possible encodings is pretty large. Are there encodings that most likely will never happen so I can exclude them? Maybe because they’re very old or used only for very specific stuff?

  • Other tools have nicely formatted names. Is there an easy way I can get names like ‘Western (ISO Latin 1)’, ‘Central European (Windows Latin 2)’ or ‘Western (Mac OS Roman)’? I can map them but I’m not sure if Latin 1 is always Western or Latin 5 is always Turkish. Would be nice if I can get that info from Xojo somehow.

  • Do I need to keep something in mind for x-plat? Are these encodings types coming from the system or is everything handled by Xojo?

In one of my applications I had a similar situation. I solved it by having an import dialog, which shows the user the first 100 rows in a listbox. He is then able to choose the encoding – and other stuff like the type of the newline character(s) – from popup menus to see live if he has chosen the correct encoding. Similar to Excel’s CSV import dialog.