Bizarre .selectStart issue in TextArea using RTF language

Jonathan_Ashwell · June 27, 2013, 5:03pm

In the TextChange event of a TextArea I have written an autocomplete function that requires me to change the location of me.selStart multiple times. It works well. But I’ve just run into the strangest problem where, if I use Hebrew -QWERTY as the input language, I am unable to assign the value of me.select to 2. I can assign 0, 1, 3, etc., but not 2. I cannot reproduce this in a simple project so I can’t file a bug report. But I wonder if there is some known combination of factors that prevents one from assigning a particular .selStart value for text in a RTL language? (I tried changing .selLength to different values, but that had no effect. And the text is 5 characters, 9 bytes, in the example I’ve been using, so me.selStart of 2 should cause no problems. And in English there are no issues).

Jonathan_Ashwell · June 27, 2013, 5:12pm

I should have noted that this is a Cocoa project, and the same code worked in Carbon in RS2012R2.1.

Joe_Ranieri · June 28, 2013, 11:00am

16954:@Jonathan Ashwell:

In the TextChange event of a TextArea I have written an autocomplete function that requires me to change the location of me.selStart multiple times. It works well. But I’ve just run into the strangest problem where, if I use Hebrew -QWERTY as the input language, I am unable to assign the value of me.select to 2. I can assign 0, 1, 3, etc., but not 2. I cannot reproduce this in a simple project so I can’t file a bug report. But I wonder if there is some known combination of factors that prevents one from assigning a particular .selStart value for text in a RTL language? (I tried changing .selLength to different values, but that had no effect. And the text is 5 characters, 9 bytes, in the example I’ve been using, so me.selStart of 2 should cause no problems. And in English there are no issues).

What are the bytes? and what is the encoding?

Jonathan_Ashwell · June 28, 2013, 1:32pm

It’s utf-8-encoded Hebrew. Interesting! – Xojo tells me it’s 5 characters, 9 bytes. But it’s really 4 characters AFAICT, a Return followed by three Hebrew characters. Here are the bytes.

0AD7 90D6 B4D7 92D7 A2

I can upload screen snaps if that would help (but not sure how, except as a link).

J_Andrew_Lipscomb · June 28, 2013, 2:41pm

??? is the string…
The first two Hebrew codepoints are, visually speaking, one character. D790 is U+05D0, aleph; D6B4 is U+05B4, hiriq (the dot underneath). Where it’s not letting you set the selStart is between the two components. (You can get the same effect in French or Spanish if you represent as U+0045 U+0301 (E, combining acute) instead of the more usual U+00C9.)

Jonathan_Ashwell · June 28, 2013, 2:52pm

Thanks very much! So it seems Xojo is incorrectly interpreting them as 2 characters instead of one – that’s why it says there are 5 characters instead of 4. That didn’t happen in Carbon. Joe, would you like me to file a bug report?

Joe_Ranieri · June 28, 2013, 4:55pm

It’s not incorrect. It’s a matter of Unicode codepoints (which is what our strings work in) and glyphs (what’s drawn). The relationship between the two is many-to-many, so there will be cases like this.

Jonathan_Ashwell · June 28, 2013, 5:09pm

Not to quibble, but Xojo thinks there are 5 “characters” and there are really 4. And (you must be so sick of hearing this) it worked in Carbon. I understand, however, if this is just too difficult to fix (or if it’s not fixable with the Cocoa APIs) that it won’t be pursued.

Joe_Ranieri · June 28, 2013, 5:15pm

It’s intended behavior though. Characters Codepoints and glyphs are fundamentally different concepts and I don’t think attempting to gloss over that is the right thing to do. The reason it worked under Carbon was that we had a completely custom text field/area (so we tracked the selection ourselves and could report whatever we wanted).

J_Andrew_Lipscomb · June 28, 2013, 5:36pm

I just tested it on a Windows box, and it will allow you to select the two parts separately in code (although not with the arrow keys–those treat the aleph-hiriq as one character).

“Characters and glyphs are fundamentally different concepts and I don’t think attempting to gloss over that is the right thing to do.”

Agreed, although I would say that characters are an ambiguous concept, which overlaps both glyphs and codepoints.

On the other hand if we’re supposed to be able to cope with this in our code, we need some functions that will allow us to find things like combining characters, or given a code point find out if it’s combining…

At the very least, the docs need to clearly explain what’s going on in this context. (I’d be quite willing to help.)

Joe_Ranieri · June 28, 2013, 6:35pm

You’re right, ‘characters’ was a poor choice of words on my my part. Like I mentioned before, our strings work on Unicode codepoints.

There’s a few different things:
a.) Unicode codepoints. These certainly should not be called a character and that was, as I mentioned, a poor choice of words on my part.
b.) Grapheme clusters are more or less a complete ‘character’. They can be one or more Unicode codepoints that combine to be more or less a logical ‘character’. This is something like U+D835 U+DC1E being “MATHEMATICAL BOLD SMALL E”.
c.) Glyphs are what’s actually rendered on screen. These are from the font, or fallback font, and are chosen at the whim of the platform’s text rendering system. The text system also decides where to put the insertion point and reports where the selection is.

It’s not entirely trivial, though it usually works how people expect (as long as they’re dealing with English text).