DrawString and Double-Byte Characters

Denise_Adams · August 31, 2014, 4:48pm

Hi. I’m having difficulty dealing with double-byte characters such as “??” which is composed of two separate characters, four bytes in all.

If I use Asc("??") it returns 1492 as the UTF8 code point but if I then do g.drawstring Chr(1492) it just draws “?” without the segment below it…which is obviously the second half of the character combination with code point 1463.

So how can I:

a) Store the correct full UTF8 character code as a single number or do I have to split it up into its two separate parts of 1492 and 1463?
b) use g.drawstring to draw the complete combined entity either from a single code value or from two separate code values?

This is driving me crazy!

Thanks.

Michel_Bujardet · August 31, 2014, 5:28pm

[quote=126013:@Denise Adams]a) Store the correct full UTF8 character code as a single number or do I have to split it up into its two separate parts of 1492 and 1463?
b) use g.drawstring to draw the complete combined entity either from a single code value or from two separate code values?[/quote]

You have two characters : the letter and the non-advancing vowel. You should store and use both to see them. The same applies to a lot of accented scripts. See http://www.unicode.org/charts/PDF/U1DC0.pdf

Beatrix_Willius · August 31, 2014, 5:41pm

Asc is meant to be used on ascii characters, which your example isn’t.

Unicode characters have basically 2 formats: pre-composed and de-composed. I always forget which is which. One is what you have: character + accent (a + `). Then there is character and accent in one (). With the MBS plugin or declares you can change between one form and the other. You should choose one form and normalize your text.

Michel_Bujardet · August 31, 2014, 6:19pm

It is strange. Maybe an effect of the forum. I just went to unicode.org and fetched the Canadian Syllabics page at http://www.unicode.org/charts/PDF/U1400.pdf

In your original post the 1492 looks like Hebrew Het 05D7 and 1463 like Hebrew Quamats 05B8, not at all like CWE and TWAA.

Using ASC is giving you perfectly inadequate values. You should stay with Unicode.

Denise_Adams · August 31, 2014, 6:24pm

Thanks for your replies. Okay, so how can I determine if a single typed key (string value) is precomposed or decomposed and how many characters it is made up of? And assuming I can do that and store the UTF8 code point for each of its parts, how do I then combine them for drawstring()? To be clear: I would like to store the integer values (one or many) for each “single string object” and not the string.

Michel_Bujardet · August 31, 2014, 6:28pm

If you stay with the string values you are safe and benefit from UTF8 support. Why do you absolutely want to go through the integer values ?

Denise_Adams · August 31, 2014, 6:35pm

It’s too complicated too explain but I need to store them as values. Can I somehow place each part into a memory block and retireve the string value that way?

Michel_Bujardet · August 31, 2014, 6:37pm

That would be the way. ASC should work on bytes when you get the unencoded string back from the MB.

But you could also store the numeric value of the code points without having to bother about byte values.

DaveS · August 31, 2014, 6:41pm

if this what you are looking for?

msgbox Encodings.UTF8.Chr(1492)

Denise_Adams · August 31, 2014, 6:43pm

Hi Dave. Unfortunately that only handles the first code point. If I have two code points 1492 and 1463 (which are the two decomposed parts of the original character) how can I combine then for the drawstring call?

Michel_Bujardet · August 31, 2014, 6:45pm

Encodings.UTF8.Chr(1492)+Encodings.UTF8.Chr(1463)

Denise_Adams · August 31, 2014, 6:50pm

Ahhh! That’s simple. Thanks! Finally, is there an easy/optimal way to determine how many parts a precomposed unicode character is made up of? So I can then split it into separate code points?

Michel_Bujardet · August 31, 2014, 6:55pm

A precomposed character will always appear in a string as two characters. It looks as one for human eyes only. The tricky part is that when the two components are next to each other, the second one is non advancing. But if you get stringwidth for the second one by itself, it will report a width. So to get an exact stringwidth, you have to use the two characters in the same string.

From what I see in the Hebrew Unicode chart, Het+Quamats does not exist in single character form, unlike Roman accented characters which come in composite or single character (As Beatrix explained).

Denise_Adams · August 31, 2014, 7:23pm

I see. So if a user of a double-byte system typed a letter in the KeyDown event of a canvas or textarea would the key string value (if made up of two parts) always be two characters long? I’ve experimented using the Hebrew keyboard and it seems that for this language and keyboard layout you have to first type the base letter and then use shift+another key to add the vowel/diacritic mark etc. But is this always the case? Is it possible for a user to input a precomposed character combination in a single keypress?

Michel_Bujardet · August 31, 2014, 7:41pm

As far as I know, there is no way to enter Het+Quamas in one key press, like one would be able under certain conditions for ü. The reason for that is that Quamas (the symbol that goes under) is not an accent, but another letter. So in effect, although it appears as one symbol, it represents two letters.

I just tried using the Hebrew qwerty keyboard layout to type ?? and keydown reports ? and ? separately. Same thing for backspace, which erases each character separately. So for all intents and purposes although these characters appear to combine, they are two different entities treated as such by Xojo. Hebrew is a calligraphic script and letterforms may differ according to the placement of the letter in the phrase.

Denise_Adams · August 31, 2014, 7:48pm

Right. Yes that’s what I found too. My problem now is how to know if a typed or pasted string of two characters needs to be combined into a single string character object or treated as two separate string character objects. Is there a way in code to know this by examining the code points or is there a unicode reference that dictates the method/algorithm to use?

Michel_Bujardet · August 31, 2014, 8:05pm

The Unicode chart at http://www.unicode.org/charts/PDF/U0590.pdf show the non advancing characters as a dottet circle with the letter next to it (under, over, etc.). So you can assume when it is one of these characters, it will combine with the preceding one.
Another approach is to measure stringwidth for every couple of characters. If the stringwidth for the two characters equal that of the first one, it is a combining one.

Denise_Adams · August 31, 2014, 9:03pm

Well I see that the “Combining Diacritical Marks” Unicode range is u+0300-036F so will anything within this range (and the Hebrew vowel range etc) always be assoicated with a preceding base character or is that not always the case and are there other ranges to take into account?

Michel_Bujardet · August 31, 2014, 9:37pm

My knowledge of Hebrew is insufficient to answer that. From a typographical point of view, though, I should think that seeing a combining diacritical mark used by itself is exceptional in everyday use.

Fact is if you experiment with these characters by adding them one by one to a textfield, for instance, the combining one does always combine. Even when it is not supposed to : ??. If the system script behaves that way, there is no reason to treat them otherwise.

Denise_Adams · August 31, 2014, 9:47pm

Hebrew was just an example. There may be other languages. My method loops through each character one by one and calls g.drawstring(char) etc so I need a fool-proof way to determine all unicode character ranges in which a character within each respective range must combine with another base character before calling g.drawstring etc. Using g.stringwidth() to test combined widths (as you suggested in a previous reply) is not optimal and not a viable option.