Warning: Split() broken for some UTF16 variants in 2018 1.1

Eric_Williams · June 26, 2018, 4:54pm

Split() is broken for UTF16BE and UTF16LE (and perhaps others I haven’t tested):

TimStreater · June 26, 2018, 6:42pm

There’s also <https://xojo.com/issue/52370> which is in the same general area.

Eric_Williams · June 26, 2018, 9:33pm

Update: UTF32, UTF32BE, and UTF32LE are also affected. A handful of other encodings are also suspect (Shift-JIS and friends) but it’s hard to say unless you’re a top-drawer text encodings expert.

Tim_Jones · June 26, 2018, 10:48pm

Have you compared using SplitB() instead of Split()?

Eric_Williams · June 26, 2018, 11:40pm

No. I’m assuming SplitB works as expected because it’s a lower-level function that basically disregards any text encoding.

Rick_Araujo · June 27, 2018, 1:57am

SplitB breaks bytes. A pure stream of bytes. One char (codepoint) in different encodings, takes many bytes (usually 1 to 4). You just will break it even more wrongly.

TimStreater · June 27, 2018, 8:50pm

The character (U+1F697) that was giving me a problem in my case is not, I now realise, in the Unicode Plane 0. Rather it is in Unicode Plane 1. The Text type is supposed to consist of characters as Unicode code points rather than bytes but I wonder whether the Text type implements all 17 Unicode planes.

@Eric Williams - was your case failing on certain characters only?

Eric_Williams · June 28, 2018, 8:52pm

Nope. My test case consists of the UTF16BE string “Tst”. The Split() command gives you an array of 8 single-character strings; the correct behavior is an array of 4 strings.

Tim_Jones · June 28, 2018, 10:39pm

So this is to split characters based on the null ("") delimiter so that you get an array of characters rather than a split based on a delimiter such as “|” or whitespace? In all my efforts, I’ve never used that but always used a MemoryBlock since I thought that Split was a text thing while I treat non-ASCII defined stuff as binary and work on the byte streams in the MemoryBlock. That was a very old mechanism that we used back in the 80’s for non-English text on the Amiga and Atari systems because they really didn’t have a way to manipulate Unicode (since there wasn’t a true definition until 1991-ish).

That explains why I’ve not run into this before - old “successful” habits die hard.

Chad_Posner · June 29, 2018, 2:58am

Here’s my case on text.Split failing to yield the correct results:
<https://xojo.com/issue/52512>