Thanks.
NOTE. The conversion to UTF-16 is not required so the entire code would be as follows:
Structure CFRange
location As Integer
length As Integer
End Structure
[code]Dim s As String
Dim a(-1) As String
s = “HelloThere”
'splits the string into unicode points
a = Split(s, “”)
'splits the string into grapheme clusters (user perceived characters)
Declare function CFStringGetRangeOfComposedCharactersAtIndex lib “CoreFoundation” (inString As Ptr, index As Integer) As CFRange
Dim cfStringObj As CFStringMBS
Dim count, i As Int32
Dim cfa(-1) As String
Dim theRange As CFRange
cfStringObj = New CFStringMBS(s)
count = cfStringObj.Len - 1
i = 0
While i <= count
theRange = CFStringGetRangeOfComposedCharactersAtIndex(Ptr(cfStringObj.Handle), i)
cfa.Append(cfStringObj.Mid(theRange.location, theRange.length))
i = i + theRange.length
[quote]There is no quick solution for this issue.
The current String library is working as designed.
The best thing to do is raise a Feature Request specifying exactly what functionality you would like the String library to have.[/quote]
So I’ve created a feature request. Maybe someone want spend some points to it. Thanks a lot. <https://xojo.com/issue/58933>
I’m a bit late to the. But surely String is able to handle multi-byte characters? After all, it handles UTF-8 which is a multibyte encoding. What am I missing here?
Emojis often consist of several multi-byte characters joined together using Unicode characters such as the ZWJ Unicode character. The String datatype doesn’t understand this as it is a higher level of text processing.
Structure CFRange
location As Integer
length As Integer
End Structure
[code] Dim s As String
Dim a(-1) As String
s = “HelloThere”
'splits the string into unicode points
a = Split(s, “”)
'splits the string into grapheme clusters (user perceived characters)
Declare Function CFStringGetLength Lib “CoreFoundation” (theString As CFStringRef) As Integer
Declare Function CFStringGetRangeOfComposedCharactersAtIndex Lib “CoreFoundation” (theString As CFStringRef, theIndex As Integer) As CFRange
Declare Function CFStringCreateWithSubstring Lib “CoreFoundation” (alloc As Ptr, str As CFStringRef, range As CFRange) As CFStringRef
Dim count, i As Int32
Dim cfa(-1) As String
Dim theRange As CFRange
count = CFStringGetLength(s) - 1
i = 0
While i <= count
theRange = CFStringGetRangeOfComposedCharactersAtIndex(s, i)
cfa.Append(CFStringCreateWithSubString(Nil, s, theRange))
i = i + theRange.length
Indeed strings do handle multi-byte characters just fine so the title of the thread is misleading. This is actually about glyphs represented by multiple codepoints (each of which are often multi-byte, but that isnt the issue here).
That’s right. In other words the issue appears to be identifying a grapheme cluster (as @Kevin Gale pointed out) and the use of zero width joiner characters (which can also be multi-byte) to come up with a compound character that has a String length of 2 or more.
Interesting. Is there a way to reliably check if the current Array index + 1 is such a Zero Width character to simply skip it during processing? Are the sequence characters of a multi-byte character/Emoji in a certain value range that could be excluded? How would you proceed?
You can test for the ZWJ character since it is just another Unicode code point.
For example,
[code] Dim kZWJ As String
Dim s As String
Dim a(-1) As String
Dim count As Int32
kZWJ = Encodings.UTF8.Chr(8205)
s = “HelloThere”
a = Split(s, “”)
count = UBound(a) - 1
While count > 0
If a(count) = kZWJ Then
a(count - 1) = a(count - 1) + a(count) + a(count + 1)
a.Remove(count + 1)
a.Remove(count)
End If
count = count - 1
Wend
Break[/code]
Unfortunately, the rules are a lot more complicated than this and to split correctly you really have to implement parts of the Unicode spec. The Text datatype and the macOS CFString example I posted are working at a level above String and understand the rules.
I was wondering if the additional text encoding properties (base, format, variant) could be helpful, but in contrast to the documentation the compiler says they are read-only. But couldn’t a ConvertEncoding to one of the Unicode Composition Encodings be a way around difficult byte checks?
I’m no expert on the subject but I don’t think the problem can be solved by encoding. Unicode allows certain code points to be combined to create other characters and if your text processing doesn’t understand those rules you will just get the individual code points. Same thing occurs with decomposed accented characters and other writing systems such as Korean.
I will try again today, as the days of text type are numbered.
If I have a string, say “Hello <> World” and use Split("") to get all the single letters, we don’t get the EMOJI as a single letter, but I get several characters in the array, but all together they make this one emoji. The text datatype was able to take this into account. What I’m looking for is a reliable way to get the same results as Text.Split, because it does calculations to find the position of a letter within this string. So far, however, the result is distorted as soon as MULTIBYTE EMOJIS/CHARACTERS are contained in a string.
Does anyone have a suggestion on how to make this work?