Warning: Split() broken for some UTF16 variants in 2018 1.1

Split() is broken for UTF16BE and UTF16LE (and perhaps others I haven’t tested):


There’s also <https://xojo.com/issue/52370> which is in the same general area.

Update: UTF32, UTF32BE, and UTF32LE are also affected. A handful of other encodings are also suspect (Shift-JIS and friends) but it’s hard to say unless you’re a top-drawer text encodings expert.

Have you compared using SplitB() instead of Split()?

No. I’m assuming SplitB works as expected because it’s a lower-level function that basically disregards any text encoding.

SplitB breaks bytes. A pure stream of bytes. One char (codepoint) in different encodings, takes many bytes (usually 1 to 4). You just will break it even more wrongly.

The character (U+1F697) that was giving me a problem in my case is not, I now realise, in the Unicode Plane 0. Rather it is in Unicode Plane 1. The Text type is supposed to consist of characters as Unicode code points rather than bytes but I wonder whether the Text type implements all 17 Unicode planes.

@Eric Williams - was your case failing on certain characters only?

Nope. My test case consists of the UTF16BE string “Tst”. The Split() command gives you an array of 8 single-character strings; the correct behavior is an array of 4 strings.

So this is to split characters based on the null ("") delimiter so that you get an array of characters rather than a split based on a delimiter such as “|” or whitespace? In all my efforts, I’ve never used that but always used a MemoryBlock since I thought that Split was a text thing while I treat non-ASCII defined stuff as binary and work on the byte streams in the MemoryBlock. That was a very old mechanism that we used back in the 80’s for non-English text on the Amiga and Atari systems because they really didn’t have a way to manipulate Unicode (since there wasn’t a true definition until 1991-ish).

That explains why I’ve not run into this before - old “successful” habits die hard.

Here’s my case on text.Split failing to yield the correct results: