Convert UTF-8 Hex value to UTF-8 code point

DaveS · October 17, 2016, 1:18am

I have string data that contains UTF8 characters that are 2 or 3 bytes in size

for example : I might get a byte sequence of 0xE2 0x84 0xA2
when I check LENB(s) I get 3 so I know what characters I need to deal with
but how can I convert it from those 3 bytes to the proper UTF-8 code point
which in this case is 0x2122

http://www.fileformat.info/info/unicode/char/2122/index.htm

Note the characters is just an example… I have no idea what the text content incoming might contain

DaveS · October 17, 2016, 2:07am

ok… I found the equation… but not sure how to extract the ASCB value from within the char

here is what I have

  For i=Len(s) DownTo 1
    c=Mid(s,i,1)
    Select Case LenB(c)
    Case 1
      Continue
    Case 2 // 110yyyyy 10xxxxxx -> 00yyy yyxxxxxx
      u=((AscB(Left(c,1)) And &h1F)*64)+(AscB(Right(c,1)) And &h3F)
    Case 3 // 1110zzzz 10yyyyyy 10xxxxxx  -> zzzzyyyy yyxxxxxx  
      u=((AscB(Left(c,1)) And &h0F)*4096)+((AscB(Mid(c,2,1)) And &H3f)*64)+(AscB(Right(c,1)) And &H3f)
     
    End Select
    
     MsgBox Str(LenB(c))+"="+Hex(u)+":"+s
  Next i

but LEFT, RIGHT and MID don’t work on a single multibyte unicode character
So LEN( C ) = 1 and LENB( C ) =2 (or 3)

Kem_Tekinay · October 17, 2016, 3:50am

I might be missing something, but why not just define the encoding of the string as UTF-8 and let Xojo tell you the character’s code?

DaveS · October 17, 2016, 4:14am

That was the first thing I thought to do, and the first thing I did
Just prior to the above snippet is

s=s.DefineEncoding(Encodings.UTF8)

and it did not change the character bytes in the string

Kem_Tekinay · October 17, 2016, 4:41am

DefineEncoding is not supposed to change the bytes of the string, nor should it in this case. The bytes, as you describe them, are valid UTF-8 encoding so DefineEncoding just tells the Xojo framework to use that encoding to interpret them. Once the encoding is known, the framework will interpret the bytes properly.

To put this into code:

  dim s as string
  
  s = ChrB( &hE2 ) + ChrB( &h84 ) + ChrB( &hA2 )
  // s has a nil encoding, Xojo thinks these are just bytes
  s = s.DefineEncoding( Encodings.UTF8 )
  // s's encoding is now properly defined
  MsgBox s // (tm)

Or store the ™ symbol in a string and examine its bytes in the debugger. You’ll find they match.

DaveS · October 17, 2016, 4:58am

you are missing the intent… I NEED the code point to place in another file (that will NOT be processed by Xojo)

s = ChrB( &hE2 ) + ChrB( &h84 ) + ChrB( &hA2 )
is NOT acceptable by the destination
but
s=ChrB(&h21)+ChrB(&h22) actually a string value of “\20442” (the OCTAL value of the codepoint)
IS acceptable

So I have to translate from the UTF8 Hex to the UTF8 Code point

Kem_Tekinay · October 17, 2016, 5:02am

I don’t think I’ve missed the point.

s = YourRawBytes
s = s.DefineEncoding( Encodings.UTF8 )
dim codePoint as integer = s.Asc
dim octValue as string = oct( codePoint )

Eli_Ott · October 17, 2016, 5:04am

It is not clear (to me) what you start with: a string like “&h41 &h42 &h43” or “ABC”.

If it is the latter you use (as I think) do as Kem has advised use DefineEncoding. Then you could use ToText and then CodePoint from the new framework.

Kem_Tekinay · October 17, 2016, 5:06am

A clarification here: “Asc” is actually a misnomer dating back to the days before Unicode. These days, Asc will return the Unicode codepoint of the first character of a string as determined by that string’s encoding. AscB will return the value of the first byte of the string.

Edit: Actually, if it’s a single-byte encoding, I think Asc and AscB will return the same value. Only with a UTF-encoded string will it return the codepoint, but I haven’t tested this and am too tired to go check the docs.

Eli_Ott · October 17, 2016, 5:09am

[code] Dim s As String = “”

Dim t As Text = s.ToText

For Each cp As Integer In t.Codepoints
// cp is 8482
Dim hex As Text = “0x” + cp.ToHex
BREAK
Next[/code]

Kem_Tekinay · October 17, 2016, 5:23am

I think we cross-posted, Eli. Converting to Text is not needed in this case, as described.

DaveS · October 17, 2016, 5:55am

then I am missing something

  s=s.DefineEncoding(Encodings.UTF8)
  For i=1 to Len(s)
    c=Mid(s,i,1)
    if lenb( c ) =1 then //c.asc<&H7f then 
      t=t+c
    else
      t=t+"\"+oct(c.asc)
    end if
next i
return t

s="|¡||£|¢|?|§|¶||ª|º||·|ª|"

comes out as ,

\\174|\\20442\\174\\174|\\21036\\174\\174|\\20042\\174â

DaveS · October 17, 2016, 6:00am

yes… I understand the “B” variants…
LEN is the length of string in CHARACTER, LENB is length in BYTES … they may or may not be the same
similar for CHR/CHRB and ASC/ASCB

So in my code… the LENB( c ) should be 1 for each “|” and 2 for “most” of the rest of the characters (3 in the case of )
but this is indicating it thinks the “|” (0x7c) is 2bytes and it is not

Kem_Tekinay · October 17, 2016, 12:45pm

We have some disconnect. I ran your exact code and got this:

|\\241|\\20442|\\243|\\242|\\21036|\\247|\\266|\\20042|\\252|\\272|\\20023|\\267|\\252|

Does s have some additional bytes that are not UTF-8?

Kem_Tekinay · October 17, 2016, 1:01pm

Also, this would be significantly faster:

  s = s.DefineEncoding( Encodings.UTF8 )
  dim chars() as string = s.Split( "" )
  for i as integer = 0 to chars.Ubound
    dim cp as integer = chars( i ).Asc
    if cp > &h7f then
      chars( i ) = "\" + oct( cp )
    end if
  next i
  
  dim t as string = join( chars, "" )