More fun with text encoding

I have written two programs:

The first accesses a web page and copies info from it into a text file. According to the page itself, the characters are in ISO-8859-1, which if I am correct is AKA Latin-1.

The second takes this text file as input, and stores it in a RealDB database (at least that’s they used to be called!). This is then used to create menus, and selections are made to display more information from the database.

After trying a few things, I managed to get the info displayed in the menus OK; I did this by storing the info as-is in the DB, then using the text converter example to convert from Latin-1 to default when inserting in the menus.

The problem I now have is that when a selection is made, I can’t find it in the database. I tried a reverse default-to-Latin-1 conversion, but it doesn’t help.

Can someone please point me in the right direction?

Conversion shouldn’t be necessary. But DefineEncoding is required when you read from a file or read from a database.

A little more info I forgot to add…

Entries are found in the database for normal characters, just not ones that contain non-English ones.

I’m a bit lost as to what I should be storing, and what I should be asking for when doing the DB read?

I did initially try a conversion to default before storing, but then couldn’t get them to display properly afterwards. I could of course have been doing it wrong!

Since you’re letting the database do comparisons, you do want to control the encoding. Start with the web page. When you pull the data from the page, and before you write it to the file, what is the encoding that is defined? Is it correct, or is it undefined? If it isn’t correct, then use DefineEncoding to set it. Once you get a correct encoding, convert it to UTF-8 to simplify all the rest of your actions.

When you read the data from the text file, open the TextInputStream and immediately set the input stream’s Encoding to UTF-8, before you read anything from it.

When you read the data from the database, use DefineEncoding on every string to tell Xojo the encoding is UTF-8.


Thanks, Tim - that makes sense. I’ll give it a go and report back.

It works now. But I’m still not 100% sure why…!

I converted the web page output to UTF-8.

Defined it (as input) as UTF-8 (just to make sure!) and stored it in the DB.

Converted it to Latin-1 before adding it to the menu.

Selecting it from the menu, it is now found in the DB regardless of whether I convert it back to UTF-8 or not. That’s the odd bit.

Still, I’m happy it works. Thanks, Tim.

Based on this:

I’d say the database encoding is probably set to UTF8.

Because Xojo normalizes to UTF-8. So you convert it to Latin-1 and Xojo converts it back to UTF-8 in the menu. You can skip that step.

This is the bane of encodings. It’s complicated enough as it is (mostly due to lack of discipline in most places, like getting UTF-8 text in pages that define themselves as ISO-8859-1) but the fact that encodings happen at levels you might not have thought about until too late.

Typical example: You control the encoding of your text, you have properly defined the encoding of your table (and fields, which can have their own encoding), you properly remembered that defining the encoding doesn’t “convert” the encoding, so you make sure the encoding is right before you put it in.

Then you pull the data and it doesn’t make sense. And you start converting here, modifying there, and two days later you realize you forgot the CLI client you used to load data had its own encoding, the terminal you’re using has its own encoding and then the program you’re using to pull and show the data has also its own encodings. And ALL need to be aligned, since you can’t go back from an incorrect double-encoding (you can from a single one, in some cases).

I have this problem every single week, with teams that don’t understand why their manual tests of data loads work differently than automated tests.

Joel Spolsky’s text is always in my bookmarks, to send to people. Helping them understand character encodings are not automated almost anywhere, but are still fully on the user/developer. Setting an encoding more often than not is a “reminder” for the user, developer and program, of what the data is supposed to be.

Also: Currently Xojo is able to autodetect with some level of accuracy if a text is in one encoding or other. I don’t know what engine is being used for this. I hoped it was the Mozilla Universal Charset Detector but since Christian has a plugin for that, I’m guessing it isn’t:

The paper for the original charset detector is HILARIOUS in that it’s encoding is misunderstood by web browsers, so it shows the typical effects of incorrect charset detection. I don’t know if this is done on purpose but it should be :smiley:

This is how it looks: