How convert string to Unicode?

TimStreater · March 11, 2022, 7:08pm

Where do your text strings come from?

Denis_DUBOIS · March 11, 2022, 7:13pm

File : com.apple.screencapture.plist

Write the file :
Shell1.Execute ("defaults write com.apple.screencapture name " + TextFieldName.Text)
Read the file :
Shell1.Execute (“defaults read com.apple.screencapture name”)

French accents are processed in unicode codes (double slash+U+ unicode) in the name key (name of the file created).
I want to be able to read the string in clear and write it with these unicode codes.

Michel_Bujardet · March 11, 2022, 7:14pm

We already had a discussion in the French channel, got nowhere. Apparently Denis does not want to explain what he is doing, and how he gets to that data. If he posted his code, that would be a great help to understand what he is after.

Denis_DUBOIS · March 11, 2022, 7:15pm

It’s done!

Rick_Araujo · March 11, 2022, 7:22pm

Now I get. He’s reading it and getting the unicode C escaped from the “defaults” app and want it re-encoded to UTF-8 to handle it in Xojo.

You need to write your “recomposer”. Something that reads your string, find the escapes “\UHHHH” and substitute them by the proper &uHHHH in Xojo (I mean, the proper codepoint).

Rick_Araujo · March 11, 2022, 9:07pm

This can help

//  EscapedUnicodeToUTF8("d\U2019e\U0301cran") -> "d’écran"
Private Function EscapedUnicodeToUTF8(escapedString As String) As String

  Var UtfString  As String = escapedString
  
  Var re As new RegEx
  Var match As RegExMatch
  
  re.SearchPattern = "\\[Uu][0-9a-fA-F]{4,4}"
  
  match = re.Search(UtfString)
  
  Do until match = Nil
    Var found, code As String
    found = match.SubExpressionString(0)
    code = Text.FromUnicodeCodepoint(Integer.FromHex(found.Middle(2)))
    UtfString = UtfString.Replace(found, code)
    match = re.Search(UtfString)
  Loop
  
  Return UtfString
  
End Function

Kem_Tekinay · March 11, 2022, 9:20pm

Or this:

Public Function ConvertUnicodeEncoded(source As String) As String
  if source.Bytes < 6 then
    //
    // Not big enough
    //
    return source
  end if
  
  var sourceChars() as string = source.Split( "" )
  var destChars() as string
  
  for i as integer = 0 to sourceChars.LastIndex
    var sourceChar as string = sourceChars( i )
    var destChar as string = sourceChar
    
    if sourceChar = "\" and ( sourceChars.LastIndex - i ) >= 6 then
      var nextChar as string = sourceChars( i + 1 )
      
      if StrComp( nextChar, "U", 0 ) = 0 then
        var isGood as boolean = true
        
        var hexArr() as string
        for hexIndex as integer = i + 2 to i + 5
          var hexChar as string = sourceChars( hexIndex )
          if hexChar >= "a" and hexChar <= "f" or hexChar >= "0" and hexChar <= "9" then
            hexArr.Add hexChar
          else
            isGood = false
            exit for hexIndex
          end if
        next
        
        if isGood then
          var hexString as string = "&h" + String.FromArray( hexArr, "" )
          var codeValue as integer = hexString.ToInteger
          destChar = chr( codeValue )
          i = i + 5
        end if
      end if
    end if
    
    destChars.Add destChar
  next
  
  return String.FromArray( destChars, "" )
  
End Function

Kem_Tekinay · March 11, 2022, 9:26pm

But actually, both solutions suffer from the same problem: escaped slashes.

Consider the original string: “Use \U301 for the accent”. Encoded that would (probably) be “Use \U301 for the accent”. Both solutions would mess that up.

(Or maybe the slash would be converted to “\U005C” and it won’t be an issue.)

Rick_Araujo · March 11, 2022, 10:31pm

It will be escaped as \\U0301, and a solution can be

Private Function EscapedUnicodeToUTF8(escapedString As String) As String

  Const escapeDoubleBackslashes As String = &uFFF9+"\"+&uFFFb
  
  Var UtfString  As String = escapedString.ReplaceAll("\\", escapeDoubleBackslashes)
  
  Var re As new RegEx
  Var match As RegExMatch
  
  re.SearchPattern = "\\[Uu][0-9a-fA-F]{4,4}"
  
  match = re.Search(UtfString)
  
  Do until match = Nil
    Var found, code As String
    found = match.SubExpressionString(0)
    code = Text.FromUnicodeCodepoint(Integer.FromHex(found.Middle(2)))
    UtfString = UtfString.Replace(found, code)
    match = re.Search(UtfString)
  Loop
  
  Return UtfString.ReplaceAll(escapeDoubleBackslashes, "\")
  
End Function

Arnaud_N · March 12, 2022, 8:41am

Beside being “right” (probably just because they are commonly used), wouldn’t it be time to remove one of these two ways to simplify things?

TimStreater · March 12, 2022, 8:53am

How do you propose to do that?

It could be that composing characters are essential in some non-Western languages - I don’t know enough about it. Apple seems to use decomposed characters, so that for example é is always made with the letter e followed by U+0301 (so, 65 CC 81 in UTF8). The composed é is U+00E9 or C3 A9 in UTF8.

You note that the decomposed é is three bytes in UTF8, and the composed é is two bytes. I don’t know why Apple does this but it’s perfectly valid UTF8 and all rendering engines should show the result as é. If they don’t then they are buggy. Personally I wish they hadn’t, but perhaps there are reasons I’m unaware of.

Arnaud_N · March 12, 2022, 9:08am

I’d say my propositions, if any, wouldn’t be smart. It’s up to the ones having the patent to make the changes. But discussing that is good.

I don’t know either, but since all composing characters can be made of decomposed ones, the OS could fairly well always use the decomposed version.

On my keyboard (Swiss-French), I can make “é” using both ways. There’s a straight “é” key but I can also use option+key-next-to-backspace (´) followed by “e” (é). I’ve never checked whether both ways lead to the same character. Could be interesting, I’ll check when I’ve time.

That’s often what we call “for compatibility purposes” (possibly). Both ways have existed since the beginning and it’s now too late to remove one easily; who knows whether that was a mistake in the first place and would have been better if the chose only the decomposed way?

Michel_Bujardet · March 12, 2022, 9:25am

Could you post your present code ?

I just created a field with accented characters in XCode:

Then I fed the saved plist into an Hex editor:

Amazingly enough, no backslash escape character there. Instead,
√©√®√ß√†√π√¢√™√Æ√¥√ª√§√´√Ø√∂√º

That is why it is important to know the method used to write to the plist.

Without knowing that, this discussion may get nowhere.

Denis_DUBOIS · March 12, 2022, 10:04am

I try it but it just removes an antislash but does not convert unicodes characters of the accents !

s = EscapedUnicodeToUTF8(TextFieldName.Text)
MessageBox(s)

Denis_DUBOIS · March 12, 2022, 10:12am

Finalement Michel, ca me semble trop complexe. Je me demande si je ne vais pas simplement “parser” la chaine caractère par caractère (dans un grand Case) et substituer les quelques caractères accentués par leur équivalent unicode. Et faire le contraire dans l’autre sens (moins évident).
du type : if machaine contient “é” : alors machaine = machaine + “\U0301”, etc.

Denis_DUBOIS · March 12, 2022, 10:28am

We are not far away but it is not the right conversion. We have the apostrophe but not the emphasis on the “é”

Denis_DUBOIS · March 12, 2022, 10:30am

for info, result must be : “Capture d’écran”

TimStreater · March 12, 2022, 10:57am

Your hex editor is not correctly rendering the characters. Did you tell it that the content is UTF8? You see after the bytes for <string>, there is C3 A9, C3 A8 which are the composed UTF8 versions of é and è. But your hex editor is displaying such as √@ etc instead.

Michel_Bujardet · March 12, 2022, 11:40am

You are right. But in UTF8 No character is displayed. Probably the Hex editor does not support composited.

Rick_Araujo · March 12, 2022, 12:43pm

Why are you adding a second backslash? The function is treating correctly what you wrote.

EscapedUnicodeToUTF8("d\U2019e\U0301cran") -> "d’écran"

And correctly also

EscapedUnicodeToUTF8("d\\U2019e\\U0301cran") -> "d\U2019e\U0301cran"

Function code: How convert string to Unicode? - #29 by Rick_A