This is the bane of encodings. It’s complicated enough as it is (mostly due to lack of discipline in most places, like getting UTF-8 text in pages that define themselves as ISO-8859-1) but the fact that encodings happen at levels you might not have thought about until too late.
Typical example: You control the encoding of your text, you have properly defined the encoding of your table (and fields, which can have their own encoding), you properly remembered that defining the encoding doesn’t “convert” the encoding, so you make sure the encoding is right before you put it in.
Then you pull the data and it doesn’t make sense. And you start converting here, modifying there, and two days later you realize you forgot the CLI client you used to load data had its own encoding, the terminal you’re using has its own encoding and then the program you’re using to pull and show the data has also its own encodings. And ALL need to be aligned, since you can’t go back from an incorrect double-encoding (you can from a single one, in some cases).
I have this problem every single week, with teams that don’t understand why their manual tests of data loads work differently than automated tests.
Joel Spolsky’s text is always in my bookmarks, to send to people. Helping them understand character encodings are not automated almost anywhere, but are still fully on the user/developer. Setting an encoding more often than not is a “reminder” for the user, developer and program, of what the data is supposed to be.
http://www.joelonsoftware.com/articles/Unicode.html
Also: Currently Xojo is able to autodetect with some level of accuracy if a text is in one encoding or other. I don’t know what engine is being used for this. I hoped it was the Mozilla Universal Charset Detector but since Christian has a plugin for that, I’m guessing it isn’t:
http://www.monkeybreadsoftware.net/class-universalcharacterdetectionmbs.shtml
The paper for the original charset detector is HILARIOUS in that it’s encoding is misunderstood by web browsers, so it shows the typical effects of incorrect charset detection. I don’t know if this is done on purpose but it should be
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
This is how it looks: