DrawString and Double-Byte Characters

Michel_Bujardet · August 31, 2014, 9:58pm

The only fool proof way is to look at what the characters do when typed. Because engineers spent countless hours figuring out the behavior of characters and the multiple combinations they can have. Hebrew is bad enough, but when you go into highly combining scripts such as Korean, it can be three or four characters. I frankly do not think it is wise to try and second guess the system input methods the way you seem to want to proceed. Why not employ the regular controls such as TextFields and TextAreas for text entry, and let the system deal with combining characters ? Why insist on gathering entry character by character ? It may work for Roman script in English, it already becomes shaky when ligatures come to play in some European languages such as AE, OE. With calligraphic or ideographic scripts, I would not venture into it.

Will_Shank · September 1, 2014, 2:15am

This is a very interesting discussion, I’ve learned a lot. I don’t know the best way to discover when letters combine but here’s a brute force strategy. Basically DrawString is doing some magic and you need a way to know when and where that magic happens. So build the full string letter by letter and measure it’s width each time to discover when it advances. Again, a lot of this is new to me and I don’t know if there’s singular/non-combining letters that don’t advance in which case this won’t work. This is my test to check that StringWidth works for this…

[code]Sub Action() //Pushbutton

dim input As String = “ab” + Encodings.UTF8.Chr(1492) + Encodings.UTF8.Chr(1463) + “cd”

dim saSrc() As String = input.Split("") //break into source ‘letters’

log("saSrc.Ubound: " + Str(saSrc.Ubound))

dim p As new Picture(1, 1)
dim g As Graphics = p.Graphics

dim saBuild(), sBuilt As String
for i As integer = 0 to saSrc.Ubound
saBuild.Append saSrc(i) //add a letter from source
sBuilt = Join(saBuild, “”) //join into single string
log(Str(i) + ") " + Format(g.StringWidth(sBuilt), “0.000#”) + " " + sBuilt) //measure
next

End Sub

Private Sub log(s As String)
ta.AppendText s + EndOfLine
End Sub

//plus have TextArea named ta

result…

saSrc.Ubound: 5
0) 6.627 a

14.1797 ab
22.7227 ab?
22.7227 ab??
28.8691 ab??c
36.4219 ab??cd[/code]

So, since 3 has the same width as 2 you can assume they’ve combined?

Michel_Bujardet · September 1, 2014, 9:48am

I offered this suggestion above ((without your nice code), but Denise objected :

I believe the same width approach is valid. But ultimately, given the possible complexity of elaborate scripts in Eastern and Asian languages, I would rather tend to trust the numerous years put into script management by system engineers.

I have no idea of the exact project, but from what it seems, the need to have a precise measurement suggest the use of text ina graphic environment. The character by character method works in English, but can be put off balance by automatic replacement of two characters by one ligature in other Roman script based European languages. For instance AE typed as two characters becomes Æ in some Nordic languages or oe becomes in French. Because of each character dimension fractional part in each character measurement, even in English it can create problems as in Xojo IDE, see https://forum.xojo.com/13571-caret-issues

The best way to benefit from the system script management would be to have a word by word approach, as I suspect the need to measure width may be related to the need to wrap. Detecting words is fairly easy, as most languages written horizontally use at least space and period as separator, plus eventually others like comma and semicolon. It is far easier to manage an entire word already composed by the system script with all the eventual replacements, than to guess what will be non-spaced or recombined. At first glance, it seems possible to use an invisible TextField to buffer the text until the word has been completed, to add it to the string to display, and eventually decide to wrap if needed. In terms of logic, managing entire words or individual characters is equivalent ; only the number of characters involved is different. One or many, the entity remains a string.

Denise_Adams · September 1, 2014, 2:48pm

Thanks everyone. It seems there is no simple solution but I intend to create a reference to all unicode non-spacing characters and combined characters etc and when they appear after a base character then just concatenate a string as necessary with the combined values in order to draw it.

Norman_P · September 1, 2014, 5:40pm

That will be an enormous list
Why is it you want to draw these as single characters (I’ve read the thread but maybe missed that) ?
And, fun fun fun, there are going to be languages where you have 1, 2, 3, or more “characters” that combine, then don’t, then do etc
Some asian languages do this

Denise_Adams · September 1, 2014, 6:45pm

Doing it this way has just proved optimal for my purposes. I’ll leave kerning pairs for another day lol!

Michel_Bujardet · September 1, 2014, 7:16pm

[quote=126244:@Norman Palardy]That will be an enormous list
Why is it you want to draw these as single characters (I’ve read the thread but maybe missed that) ?
And, fun fun fun, there are going to be languages where you have 1, 2, 3, or more “characters” that combine, then don’t, then do etc
Some asian languages do this[/quote]

Even simply in Hebrew, there is a lot of fun, when character forms change according to their place in the phrase or in the word. The system script manager that takes care of that took years to work perfectly. Forgot to mention it. All the reasons to not reinvent the wheel …

Norman_P · September 1, 2014, 7:21pm

Thats more or less my point
If you just “drawstring” the entire string rather than byte by byte or trying to do character by character you ail leave a LOT of grief as it uses the built in OS drawing routines and you don’t have to try & reinvent the wheel (and its a VERY complex wheel)

Norman_P · September 1, 2014, 7:25pm

Not knowing what/why you even want to go down this path which is fraught with a lot of ways to do thongs very wrong & and incredible amount of complexity to do it right its hard to make any suggestions that might save you a LOT of pain & suffering.

Lets say this, the code editor can handle all these fun combinations & we do not try & draw things byte by byte - we do it as strings - runs of characters - that all share common attributes (style, color etc) Thats much simpler and easier to get right.
The one issue we do have is positing the insertion point - but thats a bug that we can fix much easier than trying to get composed decomposed characters drawing properly.

Michel_Bujardet · September 1, 2014, 7:38pm

Denise, just for the record, these are not kerning pairs. They are individual glyph characters that are substituted by the script manager when AE and oe are typed. So in effect what happens has nothing to do with kerning or composite characters where you keep two characters. Here, the two characters are replaced by one. So no amount of measurement on each individual character or the same width method (that works for non advancing characters) would work.

I have been doing fonts and non roman script since 1987, and I tell you : the way Mac OS X and Windows handle complex script is not to be underestimated. No character by character quick fix can replace what script managers do for you if only you let them.

Denise_Adams · September 1, 2014, 11:10pm

Yes, I see both your points and they are very valid. But say you want to adjust style on a character by character basis and are drawing to a canvas not using a textarea? Using drawstring for an enitre word wouldn’t work if the “o” in word was bold…right? So how would you get around that?

Norman_P · September 2, 2014, 2:14am

I’d have a peek at “styled text” and how it breaks things into “runs of characters all of a consistent style”.
Nothing says anything about the styles have to be one word - they can be parts of words.
The point is YOUR not trying to figure out what runs of unicode code points make up what character - you JUST drawstring nd if someone sets a style run to start at character 6 and stop at character 7 (meaning that style is one character) you drawstring those characters & are done.

Either way - I wouldn’t do what you’re doing to figure out and draw “o” bold in the middle of a word

Michel_Bujardet · September 2, 2014, 6:07am

Is it really a common issue you are encountering ?

Even so, nothing precludes you for drawstring w, then o, then rd. That method would enable writing the French word il with the initial ligature without having to worry about the script rules.

Michel_Bujardet · September 2, 2014, 6:32am

I realize reading the above that the example of word is not valid. My point is really that it is futile to draw character by character when most of the drawing will involve larger strings. And having a change of style within a word is rare in comparison to the general way text is used.

Whatever …

J_Andrew_Lipscomb · September 2, 2014, 12:59pm

No need to reinvent the wheel:
the Unicode Consortium already created that database.

Denise_Adams · September 2, 2014, 7:16pm

Yes thank you. There is data available that can be referenced for my needs. I will look into using StyledText too.

Michel_Bujardet · September 2, 2014, 7:53pm

DrawInto can be a powerful tool as well…