I have string data that contains UTF8 characters that are 2 or 3 bytes in size
for example : I might get a byte sequence of 0xE2 0x84 0xA2
when I check LENB(s) I get 3 so I know what characters I need to deal with
but how can I convert it from those 3 bytes to the proper UTF-8 code point
which in this case is 0x2122
ok… I found the equation… but not sure how to extract the ASCB value from within the char
here is what I have
For i=Len(s) DownTo 1
c=Mid(s,i,1)
Select Case LenB(c)
Case 1
Continue
Case 2 // 110yyyyy 10xxxxxx -> 00yyy yyxxxxxx
u=((AscB(Left(c,1)) And &h1F)*64)+(AscB(Right(c,1)) And &h3F)
Case 3 // 1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyy yyxxxxxx
u=((AscB(Left(c,1)) And &h0F)*4096)+((AscB(Mid(c,2,1)) And &H3f)*64)+(AscB(Right(c,1)) And &H3f)
End Select
MsgBox Str(LenB(c))+"="+Hex(u)+":"+s
Next i
but LEFT, RIGHT and MID don’t work on a single multibyte unicode character
So LEN( C ) = 1 and LENB( C ) =2 (or 3)
DefineEncoding is not supposed to change the bytes of the string, nor should it in this case. The bytes, as you describe them, are valid UTF-8 encoding so DefineEncoding just tells the Xojo framework to use that encoding to interpret them. Once the encoding is known, the framework will interpret the bytes properly.
To put this into code:
dim s as string
s = ChrB( &hE2 ) + ChrB( &h84 ) + ChrB( &hA2 )
// s has a nil encoding, Xojo thinks these are just bytes
s = s.DefineEncoding( Encodings.UTF8 )
// s's encoding is now properly defined
MsgBox s // (tm)
Or store the ™ symbol in a string and examine its bytes in the debugger. You’ll find they match.
you are missing the intent… I NEED the code point to place in another file (that will NOT be processed by Xojo)
s = ChrB( &hE2 ) + ChrB( &h84 ) + ChrB( &hA2 )
is NOT acceptable by the destination
but
s=ChrB(&h21)+ChrB(&h22) actually a string value of “\20442” (the OCTAL value of the codepoint)
IS acceptable
So I have to translate from the UTF8 Hex to the UTF8 Code point
It is not clear (to me) what you start with: a string like “&h41 &h42 &h43” or “ABC”.
If it is the latter you use (as I think) do as Kem has advised use DefineEncoding. Then you could use ToText and then CodePoint from the new framework.
A clarification here: “Asc” is actually a misnomer dating back to the days before Unicode. These days, Asc will return the Unicode codepoint of the first character of a string as determined by that string’s encoding. AscB will return the value of the first byte of the string.
Edit: Actually, if it’s a single-byte encoding, I think Asc and AscB will return the same value. Only with a UTF-encoded string will it return the codepoint, but I haven’t tested this and am too tired to go check the docs.
s=s.DefineEncoding(Encodings.UTF8)
For i=1 to Len(s)
c=Mid(s,i,1)
if lenb( c ) =1 then //c.asc<&H7f then
t=t+c
else
t=t+"\"+oct(c.asc)
end if
next i
return t
yes… I understand the “B” variants…
LEN is the length of string in CHARACTER, LENB is length in BYTES … they may or may not be the same
similar for CHR/CHRB and ASC/ASCB
So in my code… the LENB( c ) should be 1 for each “|” and 2 for “most” of the rest of the characters (3 in the case of )
but this is indicating it thinks the “|” (0x7c) is 2bytes and it is not
s = s.DefineEncoding( Encodings.UTF8 )
dim chars() as string = s.Split( "" )
for i as integer = 0 to chars.Ubound
dim cp as integer = chars( i ).Asc
if cp > &h7f then
chars( i ) = "\" + oct( cp )
end if
next i
dim t as string = join( chars, "" )