Comments on "Do you still ASCII" blog post

Since the blog doesn’t seem to let me comment any more, I’ll add my comment here.

It’s about this post: http://blog.xojo.com/2017/02/17/do-you-still-ascii/

I think it’s confusing and misguiding, in parts.

First off, a few clarifications are needed:

  1. Any literal string is automatically getting the “UTF-8” encoding. That part is important to keep in mind.

  2. If strings are concatenated, Xojo (or the unterlying system functions) attempts to turn both into the same encoding before adding them up. So, if someone concats a string with ASCII encoding with a string with UTF-8 encoding, both will be converted to UTF8 first, then added up. The only problem are strings without an encoding (i.e. their Encoding property is nil), as that will lead to both parts and thus the concatenated result losing their encoding. Also, one need to consider what happens if one concats a UTF-8 with a UTF-16 encoded string. The result may be UTF-8 or UTF-16. Which one is not defined by Xojo, I believe, so if you later want to pass the string on to something outside of Xojo’s code, e.g. write it to a file, you better use “s.ConvertEncoding(Encodings.UTF8)” on it to make sure it’s in UTF8 encoding.

  3. ASCII is special: ASCII is only defining the first 128 char codes, and by definition the first 128 codes of UTF8 are identical to ASCII’s code.

  4. The implicit Chr() function generates UTF8 strings, based on Unicode code points.

What we can take from this:

  1. Using chr(34) + “sometext” + chr(34) is no issue at all. There is no encoding ambiguity.

  2. Using any codes < 128 with the Chr() function is also safe. When using codes > 128 with Chr(), then the advice in the blog post is sound.

  3. The real danger you have to be aware of is getting strings from any “external input”, such as from the network or from a file, where received strings often have NO encoding set, and concatenating those with other strings will make them lose their encoding as well, leading to problems when trying to display the results. For instance, I had received japanese text (which was utf8-encoded) from a HTTPSocket and tried to show that with a MessageDialog. That failed (I was seeing only “garbage”) until I explicitly set the encoding of the string to UTF8.

Thank you for your comments Thomas.

I think Comments were not enabled. So I will ask my question here:

This one is subtly worse since Chr returns a valid ASCII character only for values <= 127 and returns the character that is that code point for values > 127. Since your string is actually UTF8, when you use chr(179) you get ³ (a superscript 3) and not the vertical bar you might have expected from the ASCII chart.

Where in the ASCII 128 entries, a 179 value can exists ?

AFAIK: the ASCII Tabe defines character values from 0 thru 127.

The values from 128 to 255 are sometimes called “high ASCII”, and their representations are different for different operating systems and languages.

Oh I do remember high ASCII values from my former MS-DOS times in Pascal and Turbo-/Power-Basic. They were needed for frames and window-like dialogs but in ASCII codes. Values 178, 179 and 205 were often used for such borders and backgrounds.

As I wrote above, Chr(x) for x > 128 uses Unicode code points. The Unicode text for 179 is ³. Just google “unicode 179”.

I STILL run into text encoding weirdness when loading things from files or getting them from network connections. It’s incredibly frustrating how the whole text encoding idea works, not Xojo’s fault, just the overall planning and execution that tried to make UTF8 compatible with ASCII for most things until it breaks completely you never know there was a problem.

And then there was that time when AppleScript on OSX switched to all UTF16 characters but refused to compile if you have any UTF white space or smart quotes in the code which are automatically added if you’re using UTF anything anywhere… And then there are those unfortunate users of diacriticals that want my dictionary hashes to actually match when they use the accents or not :wink: And again, it works fine until suddenly it doesn’t because the 2 systems are compatible up until the point they are not.

It’s absolutely vital to set your encoding on anything loaded from disk or coming down a socket or up a serial port before you start using it for anything. And doubly so if you’re getting values from system calls to parts of the system that decided for unknown reasons to be different from all the rest like AppleScript. What I normally work with is a combination of binary and text data that comes over sockets or up serial ports so I read as ascii and parse packets with it as binary data since it includes length binary bytes in them, and then define encoding on any strings after I pull them out back to UTF8 or whatever. I also do some double checking of encoding before stuffing things into index dictionaries because I still sometimes find that something from somewhere doesn’t hash the same as an identical looking string did yesterday.

It’s been a long time since any of this has been a daily problem for me :wink: I solved most of the problems of conversion long ago in existing projects and now follow those best practices for new code, but I still harbor a hatred for text encoding that exceeds even my hatred for politicians, economics professors and doctors who want to be TV personalities.

Sounds like an xDev article in the making … :wink: