TextAreas, Emoji, UnicodeMode, and length counting

Apologies for the long post. This is basically information I wish I’d had when I first starting seeing issues due to emoji in TextAreas. Hopefully it helps someone else.

Scenario: your app, which includes a user-editable TextArea as a central feature, is humming along. One day, a user adds an emoji or two to their text while editing. Suddenly, everything goes sideways. Syntax highlighting no longer highlights the right words. Find can’t properly select what it’s found. Select All won’t even select all the text anymore!

This happened to me, and odds are it’ll happen to you if your app does anything with TextArea and .CharacterPosition, .SelectionStart/Length, or .StyledText.

I think this boils down to a counting issue. (And bugs.) The fundamental challenge is that there are several legitimate ways to count the length of an emoji. A good overview of the issue is this piece by Henri Sivonen. I’ll use the same example emoji that he does: a man facepalming. This emoji is comprised of 5 values: 0x1f926, 0x1f3fc, 0x200, 0x2642, 0xfe0f.

Let’s start with the string-like variable types.

dim t as Text = "🤦🏼‍♂️"
t.Length() // 1

dim s as String = "🤦🏼‍♂️"
s.Length() // 5
s.Bytes() // 17
s.Encoding.internetName // "UTF-8"

It looks like Text measures length in extended grapheme clusters (or what most people probably think of as “logical characters”). String measures length in UTF-32 code units, or what Xojo calls codepoints. (If you iterate on t.Codepoints, you’ll get 5 values back.)

There’s one other important measuring system used behind the scenes, at least on Mac (I haven’t done any testing on Windows yet). NSString, which is the basis for most of what the Mac does with text, measures length in UTF-16 code units. In the case of the sample emoji, that gives a length of 7.

If you want to count like NSString, this method works:

Sub countLikeNSString(t as Text)
  dim n, code as Integer

  n = 0
  for each code in t.Codepoints
    if code > &hFFFF then
      n = n + 2
    else
      n = n + 1
    end if
  next
End Sub

Measures of length are also used when calculating string index values. String.IndexOf and Text.IndexOf may return different results for the same search pattern, if the found instance occurs after an emoji in the source. For example:

dim index as Integer
dim t as Text = "🤦🏼‍♂️foo"
dim s as String = "🤦🏼‍♂️foo"

index = t.IndexOf("foo") // 1
index = s.IndexOf("foo") // 5

That’s fun. Now let’s look at how TextAreas count length.

As of 2020r1, TextAreas have a .UnicodeMode property that is supposed to determine how they count string length. I think the introduction of .UnicodeMode is in part meant to help people that were already getting bitten by thinking the TextArea was counting logical characters when it was really counting something else. The options are:

  • Native (on Mac, same as Codepoints)
  • Characters
  • Codepoints

Counting string length impacts a lot of things within a TextArea. I would expect one consistent counting method to be used for:

  • .SelectionStart and .SelectionLength
  • .CharacterPosition
  • .StyledText offsets and lengths

Ok, so let’s do some testing and see how the different UnicodeModes behave. Native is basically the same as Codepoints, so I’m just going to focus on Characters and Codepoints.

Codepoints

Each logical character is counted as the number of Unicode Codepoints (bytes) it requires. An emoji, for example, requires two bytes.

The documentation’s being very imprecise here. First off, codepoints and bytes are not synonymous. Second, emoji are not fixed length. A simple smiley face emoji (0x1f600) may take up two bytes, but that’s really only one codepoint. Our example emoji from above takes 5 codepoints, and the number of bytes varies depending on how codepoints are being encoded.

.CharacterPosition, .SelectionStart, and .SelectionLength all return values consistent with how NSString counts. Select the sample emoji, and .SelectionLength will be 7.

.StyledText also seems to use the NSString counting system.

Select All will not actually select all if there are emoji in the text area. I assume this is a bug (62020) whose cause is that Select All is using a different text counting system.

Characters

Each logical character is counted as a single character regardless of how many bytes are actually required.

This sounds great! This suggests to me that using UnicodeModes.Characters will count length in the same way that the Text type counts length.

.CharacterPosition currently returns values consistent with how UnicodeMode.Codepoints measures length. I think this is a bug (61659).

.SelectionStart and .SelectionLength are returning values consistent with how String.Length measures text. Select the sample emoji, and .SelectionLength will be 5. Which is… fine? But it doesn’t match the documentation.

.StyledText offset values appear to be consistent with how Text.Length measures text. This does match the documentation.

Select All will not actually select all if there are emoji.

Summary

Based on what I’ve learned, UnicodeModes.Characters sounds great on paper, but is currently horribly inconsistent and uses at least three different counting systems, depending on which API methods you’re using.

UnicodeModes.Codepoints is internally consistent, which is great. I expect most people are using this mode, largely because it’s the legacy mode (well, Native is the legacy mode, but on Mac that is the same as Codepoints).

The main problem I have with Codepoints is that its NSString-based counting system doesn’t line up with either String or Text’s counting system. This impedence mismatch leads to offset issues.

For example, I think most people would expect one or more of the following to be true:

TextArea.SelectionLength == TextArea.SelectedText.Length()
TextArea.SelectionLength == TextArea.SelectedText.ToText().Length()

If the selection is plain ASCII text, both are true. If the sample emoji is selected, neither are true.

This is a major issue because a lot of code operating on TextAreas will grab TextArea.Value, stuff it in a String/Text variable, search the variable for various things, and then try to apply the resulting String/Text-based offset values back to the TextArea via .SelectionStart/Length or .StyledText. That breaks down when the TextArea is using a different counting system from String and Text.

If you use Codepoints mode, you will need to explicitly convert String- and Text-based lengths and offsets to NSString-based values.

Hopefully UnicodeMode is going through some birthing pains and these issues get resolved. Here’s what I’d like to see.

  • Internal consistency. A mode should use the same counting system everywhere.
  • Better documentation.
  • Performance, which probably means aligning with how the native OS counts text.
  • Alignment with how either String or Text counts, to avoid impedence mismatches.

Performance and alignment may involve tradeoffs. I think I lean towards alignment, but I’m concerned that might introduce hidden and unavoidable O(N*N) conversions from one counting system to another. Perhaps a solution would be for Xojo to provide a separate set of utility methods to help when converting from one counting system to another.

1 Like