The other day I had an epiphany about Text and how it differs from the classic String, and thought I’d share. As this is based on my observations from the outside looking in, I’m sure the Xojo engineers will correct anything I get wrong.
History Of The World
Back in the day, using text was easy. Every character took exactly one byte, and every byte could only represent one character. You had 256 characters to work with, including control and other invisible characters, and that was that.
Mostly.
The first 128 characters were standardized as ASCII but the next 128 were up for grabs so they were implemented somewhat differently across platforms. Back then, if a Windows user sent an e-mail that included curly quotes to a Mac user, the latter would see funny characters instead because the Mac used different bytes to represent those quotes. That was just one problem with the system.
Then one day someone thought, “Say, how will the world get along without emoticons? And, somewhat less importantly, the Chinese language?”, so they invented Unicode.
Think of Unicode as a giant chart of every character imaginable, from the simple English alphabet through characters you’ve never seen or contemplated, and each was assigned its own unique hex ID, or “code point”. This became a standard.
Now, if you wanted to represent the letter “L”, you could look up its Unicode code point as 004C, or the paragraph mark “?” as 204B. This was true always, across platforms, through various operating systems, and everywhere. Life was better.
Encode That
But how to represent these code points in binary files, and in languages that were based on the concept of strings as a series of individual bytes? The answer was “encoding”. Several standards emerged such as UTF-8, UTF-16, and UTF-32, but they all had one goal: To take a series of Unicode code points that represent text and convert them to bytes that can be stored, then read later and converted back to text.
In short, what’s important is the code point of each character since that’s what determines what the text means. The encoding is just a way to get those code points into a file or memory so they can be converted back later, and do it without taking up too much space.
Think of encoding in the same way you would compression like Zip. The original data that matters (in this case, the series of code points) is transformed into something suitable for storage. Later, it can be deciphered back into the original data.
Knowing the encoding is just as important as knowing the compression algorithm used. If you zip a file, then try to decompress it as if it were a StuffIt file, you will get garbage back. Similarly, if you try to decipher bytes that were encoded with UTF-8 as if they are UTF-16, you will never get your original text back. You have to know the encoding first.
In The Beginning
In the early days, a REALbasic String didn’t know anything about encodings. All characters were expected to take one byte, and each byte represented a character. In short, they were like an immutable MemoryBlock with alternate methods.
As the world changed, so did the language. TextEncoding was introduced so that encoded Unicode could be properly read and represented within apps. They settled on UTF-8 encoding as the default, but any encoding could be accommodated. Internally, it was all converted to Unicode code points anyway.
This scheme fit into the existing language, but had one big drawback: It required that we, as developers, became somewhat familiar with encodings, what they meant, what the differences were, and how to deal with them. This was not ideal but acceptable.
On The Tenth Day
What’s important is that the encoding tells the system how to interpret the bytes of a string to display the text we are actually interested in. When dealing with characters that are meant for human consumption, we shouldn’t really have to care about how they are stored or manipulated internally, so with the new framework, Xojo gave us the Text type.
Whereas a String stores a series of bytes and knows (or is told) how to decipher them into characters, Text stores a series of code points. If you think of a String in terms of a fancy MemoryBlock, then think of Text as closer to an array of Integer. Each element of that “array” is a code point, and each code point represents a character.
This is why a Text does not have an encoding the way a String does. A Text does not hold bytes, it holds code points so there is nothing to encode. A Text can be converted to a String easily because it just has to UTF-8 encode its code points to create that String. But going the other way, a String must have a proper encoding so the Text can extract its code points. If a String is encoded as nil, the Text would not know how to interpret its bytes.
This is also why a String can be assigned to a MemoryBlock easily (just copy the bytes of the String to the MemoryBlock), but a Text must go through TextEncoding. In order to convert to bytes, those code points must first be encoded, so you have to choose the encoding first. Code points encoded as UTF-8 will result in different bytes than those encoded as UTF-16.
In short, the Text is merely a list of code points that represent our characters, and that’s what we are really after, right? Until we have to store or read back the text from some external source, we shouldn’t have to care about encoding, and that’s what Text does for us.
I hope this all makes sense and is helpful to somebody.