String Encoding (Multi-Byte Letters)

kevin_g · January 23, 2020, 6:16pm

Thanks.
NOTE. The conversion to UTF-16 is not required so the entire code would be as follows:

Structure CFRange location As Integer length As Integer End Structure

[code]Dim s As String
Dim a(-1) As String

s = “HelloThere”

'splits the string into unicode points
a = Split(s, “”)

'splits the string into grapheme clusters (user perceived characters)
Declare function CFStringGetRangeOfComposedCharactersAtIndex lib “CoreFoundation” (inString As Ptr, index As Integer) As CFRange

Dim cfStringObj As CFStringMBS
Dim count, i As Int32
Dim cfa(-1) As String
Dim theRange As CFRange

cfStringObj = New CFStringMBS(s)

count = cfStringObj.Len - 1
i = 0
While i <= count
theRange = CFStringGetRangeOfComposedCharactersAtIndex(Ptr(cfStringObj.Handle), i)

cfa.Append(cfStringObj.Mid(theRange.location, theRange.length))

i = i + theRange.length

Wend

Break[/code]

Martin_T · January 24, 2020, 12:05pm

Thanks Kevin and Scott for your input. Since I don’t use MBS Plugins, its not a real alternative for me, but thanks you for showing us your way.

@Robin Lauryssen-Mitchell wrote for <https://xojo.com/issue/58910> and closed it:

[quote]There is no quick solution for this issue.
The current String library is working as designed.
The best thing to do is raise a Feature Request specifying exactly what functionality you would like the String library to have.[/quote]
So I’ve created a feature request. Maybe someone want spend some points to it. Thanks a lot. <https://xojo.com/issue/58933>

TimStreater · January 24, 2020, 12:32pm

I’m a bit late to the. But surely String is able to handle multi-byte characters? After all, it handles UTF-8 which is a multibyte encoding. What am I missing here?

kevin_g · January 24, 2020, 12:45pm

Emojis often consist of several multi-byte characters joined together using Unicode characters such as the ZWJ Unicode character. The String datatype doesn’t understand this as it is a higher level of text processing.

kevin_g · January 24, 2020, 12:46pm

Version without MBS - still macOS only though.

Structure CFRange location As Integer length As Integer End Structure

[code] Dim s As String
Dim a(-1) As String

s = “HelloThere”

'splits the string into unicode points
a = Split(s, “”)

'splits the string into grapheme clusters (user perceived characters)
Declare Function CFStringGetLength Lib “CoreFoundation” (theString As CFStringRef) As Integer
Declare Function CFStringGetRangeOfComposedCharactersAtIndex Lib “CoreFoundation” (theString As CFStringRef, theIndex As Integer) As CFRange
Declare Function CFStringCreateWithSubstring Lib “CoreFoundation” (alloc As Ptr, str As CFStringRef, range As CFRange) As CFStringRef

Dim count, i As Int32
Dim cfa(-1) As String
Dim theRange As CFRange

count = CFStringGetLength(s) - 1
i = 0
While i <= count
theRange = CFStringGetRangeOfComposedCharactersAtIndex(s, i)

cfa.Append(CFStringCreateWithSubString(Nil, s, theRange))

i = i + theRange.length

Wend

Break[/code]

Michael_Hußmann · January 24, 2020, 2:20pm

Indeed strings do handle multi-byte characters just fine so the title of the thread is misleading. This is actually about glyphs represented by multiple codepoints (each of which are often multi-byte, but that isnt the issue here).

anon93744516 · January 24, 2020, 2:40pm

That’s right. In other words the issue appears to be identifying a grapheme cluster (as @Kevin Gale pointed out) and the use of zero width joiner characters (which can also be multi-byte) to come up with a compound character that has a String length of 2 or more.

It seems the Xojo String type isn’t the only place where this is a challenge grapheme clusters zero width joiner at DuckDuckGo

Interesting, eh?

Martin_T · January 25, 2020, 5:46pm

Interesting. Is there a way to reliably check if the current Array index + 1 is such a Zero Width character to simply skip it during processing? Are the sequence characters of a multi-byte character/Emoji in a certain value range that could be excluded? How would you proceed?

kevin_g · January 25, 2020, 6:58pm

You can test for the ZWJ character since it is just another Unicode code point.
For example,

[code] Dim kZWJ As String
Dim s As String
Dim a(-1) As String
Dim count As Int32

kZWJ = Encodings.UTF8.Chr(8205)

s = “HelloThere”
a = Split(s, “”)

count = UBound(a) - 1
While count > 0
If a(count) = kZWJ Then
a(count - 1) = a(count - 1) + a(count) + a(count + 1)

  a.Remove(count + 1)
  a.Remove(count)
End If

count = count - 1

Wend

Break[/code]

Unfortunately, the rules are a lot more complicated than this and to split correctly you really have to implement parts of the Unicode spec. The Text datatype and the macOS CFString example I posted are working at a level above String and understand the rules.

Ulrich_Bogun · January 26, 2020, 3:28pm

I was wondering if the additional text encoding properties (base, format, variant) could be helpful, but in contrast to the documentation the compiler says they are read-only. But couldn’t a ConvertEncoding to one of the Unicode Composition Encodings be a way around difficult byte checks?

kevin_g · January 26, 2020, 3:48pm

I’m no expert on the subject but I don’t think the problem can be solved by encoding. Unicode allows certain code points to be combined to create other characters and if your text processing doesn’t understand those rules you will just get the individual code points. Same thing occurs with decomposed accented characters and other writing systems such as Korean.

Martin_T · February 1, 2020, 2:59pm

I will try again today, as the days of text type are numbered.
If I have a string, say “Hello <> World” and use Split("") to get all the single letters, we don’t get the EMOJI as a single letter, but I get several characters in the array, but all together they make this one emoji. The text datatype was able to take this into account. What I’m looking for is a reliable way to get the same results as Text.Split, because it does calculations to find the position of a letter within this string. So far, however, the result is distorted as soon as MULTIBYTE EMOJIS/CHARACTERS are contained in a string.
Does anyone have a suggestion on how to make this work?