String vs. Text: An Explanation

The other day I had an epiphany about Text and how it differs from the classic String, and thought I’d share. As this is based on my observations from the outside looking in, I’m sure the Xojo engineers will correct anything I get wrong.

History Of The World

Back in the day, using text was easy. Every character took exactly one byte, and every byte could only represent one character. You had 256 characters to work with, including control and other invisible characters, and that was that.

Mostly.

The first 128 characters were standardized as ASCII but the next 128 were up for grabs so they were implemented somewhat differently across platforms. Back then, if a Windows user sent an e-mail that included curly quotes to a Mac user, the latter would see funny characters instead because the Mac used different bytes to represent those quotes. That was just one problem with the system.

Then one day someone thought, “Say, how will the world get along without emoticons? And, somewhat less importantly, the Chinese language?”, so they invented Unicode.

Think of Unicode as a giant chart of every character imaginable, from the simple English alphabet through characters you’ve never seen or contemplated, and each was assigned its own unique hex ID, or “code point”. This became a standard.

Now, if you wanted to represent the letter “L”, you could look up its Unicode code point as 004C, or the paragraph mark “?” as 204B. This was true always, across platforms, through various operating systems, and everywhere. Life was better.

Encode That

But how to represent these code points in binary files, and in languages that were based on the concept of strings as a series of individual bytes? The answer was “encoding”. Several standards emerged such as UTF-8, UTF-16, and UTF-32, but they all had one goal: To take a series of Unicode code points that represent text and convert them to bytes that can be stored, then read later and converted back to text.

In short, what’s important is the code point of each character since that’s what determines what the text means. The encoding is just a way to get those code points into a file or memory so they can be converted back later, and do it without taking up too much space.

Think of encoding in the same way you would compression like Zip. The original data that matters (in this case, the series of code points) is transformed into something suitable for storage. Later, it can be deciphered back into the original data.

Knowing the encoding is just as important as knowing the compression algorithm used. If you zip a file, then try to decompress it as if it were a StuffIt file, you will get garbage back. Similarly, if you try to decipher bytes that were encoded with UTF-8 as if they are UTF-16, you will never get your original text back. You have to know the encoding first.

In The Beginning

In the early days, a REALbasic String didn’t know anything about encodings. All characters were expected to take one byte, and each byte represented a character. In short, they were like an immutable MemoryBlock with alternate methods.

As the world changed, so did the language. TextEncoding was introduced so that encoded Unicode could be properly read and represented within apps. They settled on UTF-8 encoding as the default, but any encoding could be accommodated. Internally, it was all converted to Unicode code points anyway.

This scheme fit into the existing language, but had one big drawback: It required that we, as developers, became somewhat familiar with encodings, what they meant, what the differences were, and how to deal with them. This was not ideal but acceptable.

On The Tenth Day

What’s important is that the encoding tells the system how to interpret the bytes of a string to display the text we are actually interested in. When dealing with characters that are meant for human consumption, we shouldn’t really have to care about how they are stored or manipulated internally, so with the new framework, Xojo gave us the Text type.

Whereas a String stores a series of bytes and knows (or is told) how to decipher them into characters, Text stores a series of code points. If you think of a String in terms of a fancy MemoryBlock, then think of Text as closer to an array of Integer. Each element of that “array” is a code point, and each code point represents a character.

This is why a Text does not have an encoding the way a String does. A Text does not hold bytes, it holds code points so there is nothing to encode. A Text can be converted to a String easily because it just has to UTF-8 encode its code points to create that String. But going the other way, a String must have a proper encoding so the Text can extract its code points. If a String is encoded as nil, the Text would not know how to interpret its bytes.

This is also why a String can be assigned to a MemoryBlock easily (just copy the bytes of the String to the MemoryBlock), but a Text must go through TextEncoding. In order to convert to bytes, those code points must first be encoded, so you have to choose the encoding first. Code points encoded as UTF-8 will result in different bytes than those encoded as UTF-16.

In short, the Text is merely a list of code points that represent our characters, and that’s what we are really after, right? Until we have to store or read back the text from some external source, we shouldn’t have to care about encoding, and that’s what Text does for us.

I hope this all makes sense and is helpful to somebody.

I don’t think so, and that’s the whole point of the Text type, to finally convert the string to pure code points so encoding doesn’t matter in quite the same way. String holds its data in encoded byte values, along with the name of the secret decoder ring necessary to interpret those values. One of the problems inherent in String is that you can encode a UTF-8 code point in more than one way, which means that 2 strings with the same “characters” (code points) could be unequal due to the actual byte values used to encode those characters. Text attempts to alleviate some of those problems.

Sorry, that doesn’t make sense to me. A byte is a byte is a byte. And text or string or whatever are just bytes. Encoding IS important, because it gives meaning to the bytes. But we still don’t know anything about the internal representation. “Converted to Unicode code points” means nothing.

The new framework won’t solve precomposed vs. decomposed because this is a feature of Unicode itself. I’ve had that fun myself. Thought I was going bonkers because the string comparison didn’t work anymore.

This part I think is wrong:

The docs say: The Text type is used to store text with a defined encoding. You cannot store Unicode code points in memory just like that. You must encode code points into a byte (or a sequence of bytes) and then store these bytes.

[quote=187819:@Beatrix Willius]Sorry, that doesn’t make sense to me. A byte is a byte is a byte. And text or string or whatever are just bytes. Encoding IS important, because it gives meaning to the bytes. But we still don’t know anything about the internal representation. “Converted to Unicode code points” means nothing.

The new framework won’t solve precomposed vs. decomposed because this is a feature of Unicode itself. I’ve had that fun myself. Thought I was going bonkers because the string comparison didn’t work anymore.[/quote]

As far as I can tell from early experiments, Text is a succession of code points values. Even if, ultimately, the underlying structure is made of bytes (like most everything in a computer data), the Text type itself knows no bytes.

You are perfectly correct, though : Text is no more capable to interpret precomposed glyphs than was UTF. As it is only recording code points, it will just record non moving “`” accent, then “a” character, like a dumb machine, instead of seeing “à” like a decent human being. There was a discussion in this forum a while ago and some say that Text was capable of interpreting precomposed, which made me run experiments showing very clearly it is not at all the case.

I think the main advantage of Text is that it makes it impossible to have blocks with no encoding, unlike string and its infamous lozenges.

Sure will.

[quote=187814:@Kem Tekinay]History Of The World

Back in the day, using text was easy. Every character took exactly one byte, and every byte could only represent one character. You had 256 characters to work with, including control and other invisible characters, and that was that.

Mostly.

The first 128 characters were standardized as ASCII but the next 128 were up for grabs so they were implemented somewhat differently across platforms. Back then, if a Windows user sent an e-mail that included curly quotes to a Mac user, the latter would see funny characters instead because the Mac used different bytes to represent those quotes. That was just one problem with the system.

Then one day someone thought, “Say, how will the world get along without emoticons? And, somewhat less importantly, the Chinese language?”, so they invented Unicode.

Think of Unicode as a giant chart of every character imaginable, from the simple English alphabet through characters you’ve never seen or contemplated, and each was assigned its own unique hex ID, or “code point”. This became a standard.

Now, if you wanted to represent the letter “L”, you could look up its Unicode code point as 004C, or the paragraph mark “?” as 204B. This was true always, across platforms, through various operating systems, and everywhere. Life was better.[/quote]

This is more or less all correct in concept, if not in fact.

[quote=187814:@Kem Tekinay]Encode That

But how to represent these code points in binary files, and in languages that were based on the concept of strings as a series of individual bytes? The answer was “encoding”. Several standards emerged such as UTF-8, UTF-16, and UTF-32, but they all had one goal: To take a series of Unicode code points that represent text and convert them to bytes that can be stored, then read later and converted back to text.

In short, what’s important is the code point of each character since that’s what determines what the text means. The encoding is just a way to get those code points into a file or memory so they can be converted back later, and do it without taking up too much space.

Think of encoding in the same way you would compression like Zip. The original data that matters (in this case, the series of code points) is transformed into something suitable for storage. Later, it can be deciphered back into the original data.

Knowing the encoding is just as important as knowing the compression algorithm used. If you zip a file, then try to decompress it as if it were a StuffIt file, you will get garbage back. Similarly, if you try to decipher bytes that were encoded with UTF-8 as if they are UTF-16, you will never get your original text back. You have to know the encoding first.[/quote]

This is also more or less correct.

[quote=187814:@Kem Tekinay]In The Beginning

In the early days, a REALbasic String didn’t know anything about encodings. All characters were expected to take one byte, and each byte represented a character. In short, they were like an immutable MemoryBlock with alternate methods.

As the world changed, so did the language. TextEncoding was introduced so that encoded Unicode could be properly read and represented within apps. They settled on UTF-8 encoding as the default, but any encoding could be accommodated. Internally, it was all converted to Unicode code points anyway.[/quote]

Internally it was not all converted to Unicode. They bytes of a string when it’s created are the bytes stored internally because it has to pass them back as-is. This means the framework has to know how to do operations in all of the supported encodings, or take a hit to convert it to something else (while still holding on to a copy of the original bytes).

Definitely not ideal, and the source of many, many bugs. Also, the source of forum posts, emails, and blog posts on my part to explain this.

[quote=187814:@Kem Tekinay]On The Tenth Day

What’s important is that the encoding tells the system how to interpret the bytes of a string to display the text we are actually interested in. When dealing with characters that are meant for human consumption, we shouldn’t really have to care about how they are stored or manipulated internally, so with the new framework, Xojo gave us the Text type.

Whereas a String stores a series of bytes and knows (or is told) how to decipher them into characters, Text stores a series of code points. If you think of a String in terms of a fancy MemoryBlock, then think of Text as closer to an array of Integer. Each element of that “array” is a code point, and each code point represents a character.

This is why a Text does not have an encoding the way a String does. A Text does not hold bytes, it holds code points so there is nothing to encode. A Text can be converted to a String easily because it just has to UTF-8 encode its code points to create that String. But going the other way, a String must have a proper encoding so the Text can extract its code points. If a String is encoded as nil, the Text would not know how to interpret its bytes.

This is also why a String can be assigned to a MemoryBlock easily (just copy the bytes of the String to the MemoryBlock), but a Text must go through TextEncoding. In order to convert to bytes, those code points must first be encoded, so you have to choose the encoding first. Code points encoded as UTF-8 will result in different bytes than those encoded as UTF-16.

In short, the Text is merely a list of code points that represent our characters, and that’s what we are really after, right? Until we have to store or read back the text from some external source, we shouldn’t have to care about encoding, and that’s what Text does for us.[/quote]

Dead on.

Yes. It’s similar to how arrays ultimately have an underlying structure made of bytes, but people programming in Xojo don’t need to know how it’s laid out and the details can change from version to version.

Which platform did you test on? Text operates on grapheme clusters, so this should work (I think).

This definitely is one of its main advantages.

If I remember right it was in iOS. But that is quite a while ago. I just built a new test project in desktop, and indeed, “a”+“`” which in String has 2 bytes, becomes “à” in one byte when using string.toText. I stand corrected.

This is definitely a striking improvement over UTF-8 strings.

If the terms I used were less than ideal, please keep in mind that I wrote that at 1:30 AM. If it’s in the least coherent, I consider it a win. :slight_smile:

Also, this post was meant to simplify a rather complex topic for those who are new to, or confused by, it. It’s not meant to address every nuance for those who are already well-versed, but rather provide analogies to other familiar structures as a mental reference point.

Finally, Michel and Joe covered some other points I was about to make, so I’ll leave that there.

There is a little-known backstory here. The first person to come up with the idea of a massive, indexed chart of characters was a pioneering woman of Hungarian-Chinese descent, Ugenia Ni. As her creation gained popularity, it was initially referenced as Ugenia Ni’s codes, and later abbreviated to U. Ni’s code. Through a series of happenstance, misprints, and lost translations, it finally came to be known universally as Unicode, its origin forever lost to history.

True story. And you know it’s true because you’ve read it here on the Interwebs, so need to go look it up and embarrass yourselves. Or, you know, anyone else.

To add a bit more to this history, check out this video. I got the link from @Joe Ranieri 's post at https://forum.xojo.com/conversation/post/193214 . The video is a great, short explanation of Unicode and how UTF-8 encoding works.

It was 1948, when Ugenia started write her internal character representation proposal, originally would be known as INCHAR (INternal CHArarcter Representation, but coincidentally and interestingly the word “inchar” means “swell” in Portuguese). Sadly her studies were lost in 1949 ,when she was enhancing her works while in a distracted walk on the Caribbean island of Montserrat, and fell in the Soufrire Hills volcano with her papers… after being hit by the Santa Claus flying sled passing by. End of story. But luckily, in the 80’s, some engineers at the Xerox labs started to talk about the necessity of such kind of beyond 7 bit char representations and even used proprietary encodings in 1980 to represent hundreds of chars, including JIS (Japanese) characters in their Star workstation. In 1986, Lee Colins at Xerox, started to work with some other Xerox engineers trying to find relationships between JIS and Chinese characters for a new scheme. Later in 1987, Mark Davis from Apple joins Joe Becker and Lee Collins from Xerox to start the Unicode as a project. Becker coined the word “Unicode” and this term made it’s first public appearance in his paper of 1988: http://www.unicode.org/history/unicode88.pdf