Function UTF8StringToArray(value As String) As String
' Converts UTF-8 strings to codepoints array
Var out() As String
Var length As Integer = value.Length
Var uni, h, c As Integer
For i As Integer = 0 To length - 1
c = value.Middle(i + 1, 1).Asc
uni = -1
h = value.Middle(i, 1).Asc
If h <= &h7F Then
uni = h
Elseif h >= &hC2 Then
If h <= &hDF And i < length - 1 Then
uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h1F), 6) Or _
Bitwise.BitAnd(c, &h3F)
Elseif h <= &hEF And i < length - 2 Then
uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h0F), 12) Or _
Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 6) Or _
Bitwise.BitAnd(c, &h3F)
Elseif h <= &hF4 And i < length - 3 Then
uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h0F), 18) Or _
Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 12) Or _
Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 6) Or _
Bitwise.BitAnd(c, &h3F)
End If
If uni >= 0 Then
out.Add(uni.ToString)
End If
End If
Next
Return String.FromArray(out, " ")
End Function
Maybe I don’t understand the problem. What are you trying to do, and where is the issue you encounter? I’ve taken your function, tossed it out, and used Xojo’s Split function to entirely replace it. More info?
And for FPDF there is an extension with which the library also supports UTF-8 strings. For this three methods were added to the PHP library, namely this one and I try to transfer this into Xojo, so that RSPDF also supports UTF-8.
Looks to me like the method wants to go through the bytes, check that the byte plus following ones make a UTF-8 character, and then store that in the array. At the end you join these characters together with each one separated by a space.
But while PHP deals with bytes in a string, Xojo deals with characters, so I’m not sure you can directly rewrite this PHP code as Xojo. And as @Anthony_G_Cyphers says, this can be done with split().
I have a method which validates that a string is UTF-8; the first thing I do is use SplitBytes() on it.
As far as I understand correctly, PDF raw data always consists only of characters that cover the ASCII range. So it is a matter of converting Unicode strings into PDF readable ASCII encodings. But maybe @Javier_Menendez knows more about this!
Aha, but then this information already helps. Then it would probably be smart to pass the string to a MemoryBlock and then use the MemoryBlock to cycle through the bytes, right?
You could do that. As I said, in my method I use SplitBytes and produce a String array from that. Then I treat each element of this array as a byte for masking to ensure it’s within range.
If I’m reading it correctly, your PHP function is interpreting the bytes of a UTF8-encoded string to return an array of the Unicode code points it represents.
Xojo, being a Unicode-aware language, makes it unnecessary to jump through these hoops, so your function is as simple as this:
Function ToCodePoints (Extends s As String) As Integer()
var arr() as integer
for each char as string in s.Split( "" )
arr.Add char.Asc
next
return arr
End Function
Humph. Perhaps the asc() function needs a new name, such as ToUnicodeValue(). I wouldn’t have automatically turned to asc() to give me the unicode code point value for a character.
So, I did some more research on this. It looks like the code only works for characters between 0..255, but not for multibyte characters. This seems to be due to the PDF standard, which requires you to generate CMap font definitions for multibyte characters (Japanese, Thai, Emojis, etc.). All very very complicated. I hope so much that @Xojo an offer us something like this for PDFDocument very soon.