Conversion Check

Hi,

sadly I’m not familiar with PHP, but tried to translate this code to Xojo, but don’t get any results. Where is the error in my code?

Call

MessageBox(UTF8StringToArray("ประเทศไทย")) ' Thai: "Thailand"

PHP (Original)

// Converts UTF-8 strings to codepoints array
protected function UTF8StringToArray($str) {
   $out = array();
   $len = strlen($str);
   for ($i = 0; $i < $len; $i++) {
	$uni = -1;
    $h = ord($str[$i]);
    if ( $h <= 0x7F )
       $uni = $h;
    elseif ( $h >= 0xC2 ) {
       if ( ($h <= 0xDF) && ($i < $len -1) )
          $uni = ($h & 0x1F) << 6 | (ord($str[++$i]) & 0x3F);
       elseif ( ($h <= 0xEF) && ($i < $len -2) )
          $uni = ($h & 0x0F) << 12 | (ord($str[++$i]) & 0x3F) << 6
                                   | (ord($str[++$i]) & 0x3F);
       elseif ( ($h <= 0xF4) && ($i < $len -3) )
          $uni = ($h & 0x0F) << 18 | (ord($str[++$i]) & 0x3F) << 12
                                   | (ord($str[++$i]) & 0x3F) << 6
                                   | (ord($str[++$i]) & 0x3F);
    }
	if ($uni >= 0) {
		$out[] = $uni;
	}
   }
   return $out;
}

Xojo

Function UTF8StringToArray(value As String) As String
  ' Converts UTF-8 strings to codepoints array

  Var out() As String
  Var length As Integer = value.Length
  Var uni, h, c As Integer

  For i As Integer = 0 To length - 1
  
    c = value.Middle(i + 1, 1).Asc
    uni = -1
    h = value.Middle(i, 1).Asc
  
    If h <= &h7F Then  
      uni = h
    
    Elseif h >= &hC2 Then
    
      If h <= &hDF And i < length - 1 Then      
        uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h1F), 6) Or _
            Bitwise.BitAnd(c, &h3F)
      
      Elseif h <= &hEF And i < length - 2 Then  
        uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h0F), 12) Or _
            Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 6) Or _
            Bitwise.BitAnd(c, &h3F)
      
      Elseif h <= &hF4 And  i < length - 3 Then
        uni = Bitwise.ShiftLeft(Bitwise.BitAnd(h, &h0F), 18) Or _
            Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 12) Or _
            Bitwise.ShiftLeft(Bitwise.BitAnd(c, &h3F), 6) Or _
            Bitwise.BitAnd(c, &h3F)
      
      End If
    
      If uni >= 0 Then
        out.Add(uni.ToString)     
      End If
    
    End If
  
  Next

  Return String.FromArray(out, " ")
End Function

Maybe @Kem_Tekinay Or / @MVP / @Xojo?

How about…

var s as String = "ประเทศไทย"
var chars() as String = s.Split( "" )
1 Like

What does that change about the algorithm? Little I think, since obviously also within the letters i + 2/3 are queried.

Maybe I don’t understand the problem. What are you trying to do, and where is the issue you encounter? I’ve taken your function, tossed it out, and used Xojo’s Split function to entirely replace it. More info?

1 Like

Except the return, perhaps.

It is about converting a UTF-8 string into a valid PDF code.

There is the FPDF (http://www.fpdf.org) port RSPDF here: https://github.com/roblthegreat/rsfpdf

And for FPDF there is an extension with which the library also supports UTF-8 strings. For this three methods were added to the PHP library, namely this one and I try to transfer this into Xojo, so that RSPDF also supports UTF-8.

protected function _UTF8toUTF16($s)
{
	// Convert UTF-8 to UTF-16BE with BOM
	$res = "\xFE\xFF";
	$nb = strlen($s);
	$i = 0;
	while($i<$nb)
	{
		$c1 = ord($s[$i++]);
		if($c1>=224)
		{
			// 3-byte character
			$c2 = ord($s[$i++]);
			$c3 = ord($s[$i++]);
			$res .= chr((($c1 & 0x0F)<<4) + (($c2 & 0x3C)>>2));
			$res .= chr((($c2 & 0x03)<<6) + ($c3 & 0x3F));
		}
		elseif($c1>=192)
		{
			// 2-byte character
			$c2 = ord($s[$i++]);
			$res .= chr(($c1 & 0x1C)>>2);
			$res .= chr((($c1 & 0x03)<<6) + ($c2 & 0x3F));
		}
		else
		{
			// Single-byte character
			$res .= "\0".chr($c1);
		}
	}
	return $res;
}

// ********* NEW FUNCTIONS *********
// Converts UTF-8 strings to UTF16-BE.
protected function UTF8ToUTF16BE($str, $setbom=true) {
	$outstr = "";
	if ($setbom) {
		$outstr .= "\xFE\xFF"; // Byte Order Mark (BOM)
	}
        // mb_convert_encoding = ConvertEncoding(Encodings.UTF16BE)
	$outstr .= mb_convert_encoding($str, 'UTF-16BE', 'UTF-8');
	return $outstr;
}

// Converts UTF-8 strings to codepoints array
protected function UTF8StringToArray($str) {
   $out = array();
   $len = strlen($str);
   for ($i = 0; $i < $len; $i++) {
	$uni = -1;
    $h = ord($str[$i]);
    if ( $h <= 0x7F )
       $uni = $h;
    elseif ( $h >= 0xC2 ) {
       if ( ($h <= 0xDF) && ($i < $len -1) )
          $uni = ($h & 0x1F) << 6 | (ord($str[++$i]) & 0x3F);
       elseif ( ($h <= 0xEF) && ($i < $len -2) )
          $uni = ($h & 0x0F) << 12 | (ord($str[++$i]) & 0x3F) << 6
                                   | (ord($str[++$i]) & 0x3F);
       elseif ( ($h <= 0xF4) && ($i < $len -3) )
          $uni = ($h & 0x0F) << 18 | (ord($str[++$i]) & 0x3F) << 12
                                   | (ord($str[++$i]) & 0x3F) << 6
                                   | (ord($str[++$i]) & 0x3F);
    }
	if ($uni >= 0) {
		$out[] = $uni;
	}
   }
   return $out;
}

What is a “valid PDF code”?

Looks to me like the method wants to go through the bytes, check that the byte plus following ones make a UTF-8 character, and then store that in the array. At the end you join these characters together with each one separated by a space.

But while PHP deals with bytes in a string, Xojo deals with characters, so I’m not sure you can directly rewrite this PHP code as Xojo. And as @Anthony_G_Cyphers says, this can be done with split().

I have a method which validates that a string is UTF-8; the first thing I do is use SplitBytes() on it.

As far as I understand correctly, PDF raw data always consists only of characters that cover the ASCII range. So it is a matter of converting Unicode strings into PDF readable ASCII encodings. But maybe @Javier_Menendez knows more about this!

Aha, but then this information already helps. Then it would probably be smart to pass the string to a MemoryBlock and then use the MemoryBlock to cycle through the bytes, right?

You could do that. As I said, in my method I use SplitBytes and produce a String array from that. Then I treat each element of this array as a byte for masking to ensure it’s within range.

If I’m reading it correctly, your PHP function is interpreting the bytes of a UTF8-encoded string to return an array of the Unicode code points it represents.

Xojo, being a Unicode-aware language, makes it unnecessary to jump through these hoops, so your function is as simple as this:

Function ToCodePoints (Extends s As String) As Integer()
  var arr() as integer
  for each char as string in s.Split( "" )
    arr.Add char.Asc
  next
  return arr
End Function

(Untested.)

2 Likes

I may be missing something, but it seems this whole exercise is to convert a UTF8 string to a UTF16BE string?

Why not use convertencoding? http://documentation.xojo.com/api/text/encoding_text/convertencoding.html

I just ran the PHP function and it does as I described, returns an array of Unicode code points.

1 Like

Humph. Perhaps the asc() function needs a new name, such as ToUnicodeValue(). I wouldn’t have automatically turned to asc() to give me the unicode code point value for a character.

So, I did some more research on this. It looks like the code only works for characters between 0..255, but not for multibyte characters. This seems to be due to the PDF standard, which requires you to generate CMap font definitions for multibyte characters (Japanese, Thai, Emojis, etc.). All very very complicated. I hope so much that @Xojo an offer us something like this for PDFDocument very soon.