How to split a string into an array of words that ignores text direction

Jonathan_Ashwell · February 28, 2025, 6:25pm

I have method that splits a string into words and then draws those words, one at a time, in a listbox CellTextPaint event. Like

var s as string = "Hello World"
var a() as string = s.split(" ")
//a(0) = "Hello"; a(1) = "World"

When I do the same with a RTL language, such as Hebrew, the “second” word (second from the left) is now at index 0

var s as string = "קךךם רךג"
var a() as string = s.split(" ")
//a(0) = "קךךם"; a(1) = "רךג"

I understand that there is some logic behind this, but it defeats the purpose of the method. I don’t care about the correct lexical order, just the left-to-right order. Unfortunately, I can’t roll my own split() because all string functions, like left, mid, Nthfield, etc., start counting from the right.

I know that I could do a reverse sort on the string array, but that assumes I know that the string is in a RTL language, which I don’t (the text can come from anywhere).

I’ve tried setting the string encoding to ASCII or even nil and performing a SplitBytes(), but that doesn’t give the LTR values, either.

Can something think of an elegant way to get Xojo (with or without MBS) to return an array of words that uses the LTR orger and ignores the lexical text direction?

kevin_g · February 28, 2025, 6:54pm

I don’t think there is anything built-in or available from MBS that would help.

My gut feeling is that you would need something like the ICU Boundary Analysis functions. I’ve logged feature requests over the years to make them available:
https://tracker.xojo.com/xojoinc/xojo/-/issues/77097
https://tracker.xojo.com/xojoinc/xojo/-/issues/51340

Kem_Tekinay · February 28, 2025, 7:15pm

I don’t think your request makes sense. The display order is visual only, but the bytes are unaffected, so Split is giving you the words in the order they appear in the string.

To put it another way, imagine if a Hebrew-speaking programmer asked if there was a way for “Hello World” to be split into a( 0 ) = "World" a( 1 ) = "Hello" because that’s how they are trained to read it. (Although, to them, they would read “dlroW olleH”.)

What you are looking to do is reverse the order of the words while preserving the displayed order the characters, and you’ll have to do that manually.

Jonathan_Ashwell · February 28, 2025, 7:22pm

I said it makes sense lexically to reverse the order. But it makes it impossible to easily render the words as they appear visually. What I’m doing makes a great deal of sense in context. And I’m quite willing to do it manually. The question is how? As I said, none of the string functions I’m aware of let me treat the text as a “bag of bytes”. I’ve given some of the avenues I’ve explored (changing encoding, trying to treat the text at the byte level, etc.) have worked. Given that I don’t know in advance what the text will be (RTL, LTR, or a mixture) I need a universal solution that ignores the lexical interpretation of the character order and can work with the order in which the characters are actually displayed. Suggestions welcome!

Jonathan_Ashwell · February 28, 2025, 8:21pm

It seems being able to detect the language of the string would be a major step in the right direction. I can’t find NLLanguageRecognizer in the MBS plugins, but maybe it’s capabilities are there under a different name. @Christian_Schmitz Is there an equivalent?

Robert_Bednar · February 28, 2025, 8:57pm

I’m probably just not getting what you want to do…but I had wanted to do “something similar” before. For instance…it’s easy to get the second “grouping” of characters split by some arbitrary character (like a space) from the LEFT …but it’s not as easy to code for the second to last grouping (extract count from the right) …because you don’t know how many groups there are. I wanted my own function where I could do group extractions from the LEFT or RIGHT. What I did was reverse the entire string…then perform normal LEFT to RIGHT extraction…and then…reverse the result.

For Instance Source = “every good boy does fine” and I want the second and third to last words from the right → “boy does” I didn’t want “does boy”

So I reversed: “enif seod yob doog yreve” and picked up the second a third words from the LEFT.
“seod yob” and then reversed the result “boy does”

Any help?

Jonathan_Ashwell · February 28, 2025, 9:11pm

Thanks, but no. I can reverse the order of the array of words, of course, but at the moment it’s not clear how to get the text direction so I know it’s, say, Arabic (RTL) and not Chinese (LTR). That’s why I wondered in my last post if such a method is available. In MBS or elsewhere.

Christian_Schmitz · February 28, 2025, 9:14pm

You can check the ASC() codes of each character and see if they are in the range for Hebrew characters.

The key thing is that the OS when drawing them detects that they are RTL characters, so they are drawn in the other direction.

So Split() returns the right thing. But you need to find the last Hebrew thing and draw them in back order. (as far as I know)

Jonathan_Ashwell · February 28, 2025, 9:33pm

But it can be any language, I just chose Hebrew as an example. If you can detect the language you should be able to detect the writing direction. I thought it might be possible with NSAttributedStringMBS, but couldn’t get that to work. There is also a NSSpellCheckerMBS.languageForWordRange, but in the examples I tried to create the result was always an empty string.

Jonathan_Ashwell · February 28, 2025, 10:51pm

I think this will do the trick:

var language as string = NSLinguisticTaggerMBS.dominantLanguageForString("سيرة ديكارت منقولةً عن مجلة")
//language = "ar" (Arabic)

var language as string = NSLinguisticTaggerMBS.dominantLanguageForString("Υπάρχουν εφτά μέρες σε μια εβδομάδα")
//language = "el" (Greek)

kevin_g · March 1, 2025, 10:29am

See the Unicode Bidi algorithm:

https://unicode.org/reports/tr9/

Michael_Hußmann · March 1, 2025, 10:46am

But they do – the string is treated as an ordered bag of bytes (or rather code points). So if you split a string the first word is always at index 0, the second word at index 1 and so on. Split always works the same, regardless of the language.

The problem is that you want to treat the string not in the logical order its bytes (or code points), but in the order they are displayed, i.e. left to right for English or right to left for Hebrew or Arabic. But Split doesn’t know about how text will be rendered on the screen eventually so it only looks at the bytes.

Michael_Hußmann · March 1, 2025, 1:55pm

So you want an array containing the words in left to right order as they get rendered on the screen (or the printed page for that matter). As Xojo (and string processing in general) doesn’t care about the direction in which the characters will eventually be written and only deals with their logical order – the first character is stored at the lowest address and the last character at the highest address – you need to take care of this yourself. You must determine in which direction each part of the text will get printed and reverse it if it will be written right to left.

This can get tricky if you have to deal with mixed languages. For example, in

“The expression “קךךם רךג” is Hebrew for “Thank you”.”

you only want to reverse the order of the two Hebrew words but not the English words.