How convert string to Unicode?

TimStreater · March 12, 2022, 12:51pm

Ha, I bet that is Hex Fiend you are using. I just tried it too.

I typed in 3E C3 A9 C3 A8 C3 A7 65 CC 81 - and it displays like yours. But if you then select that hex and with the little menu at the bottom of the window choose UTF-8 then it displays it as UTF-8. I dunno why the right-hand column ignores all attempts to chasge the encoding it uses to display the hex.

Kem_Tekinay · March 12, 2022, 12:55pm

Your input there has double slashes which, as mentioned, my code does not handle. It’s an easy fix though, if you want to tackle it.

TimStreater · March 12, 2022, 12:57pm

This may be of interest:

In article <j935p4F3jpeU1@mid.individual.net>, TimS <tim@example.com> wrote:

Anyone here know why Apple seems to use decomposed UTF8 when making characters with accents?

I don’t know what Apple’s policy is, but there’s no good single solution to this.

One problem is that there are a vast number of possible combinations of letters and accents (more generally, diacritics), and only some of them exist as precomposed characters in Unicode. This means that to produce the composed normal form a program (or library) has to know which ones exist as composed characters. On the other hand, there can be characters with more than one diacritic. With decomposed characters, in which order should the accents appear? Some pairs of diacritics have a visible order - in some languages a character may have both a circumflex and an acute accent, it’s necessary to specify which way round they go. In other cases they don’t, such as a circumflex and a cedilla which by their nature go on the top and bottom (I doubt that particular example really occurs). Unicode therefore defines combining classes and a canonical ordering for combining characters to make it easier to compare two strings.

Rick_Araujo · March 12, 2022, 1:03pm

The ICU lib, that Xojo uses, can handle all those. Composition, decomposition, compare order (collation)…

Denis_DUBOIS · March 12, 2022, 2:15pm

Thank you, I will test.
In the.plist file the unicodes are written each time with two Backslash (but only one is displayed in the forum, I don’t know why! Sorry). Can you make the change please ?

Denis_DUBOIS · March 12, 2022, 2:33pm

@Rick_Araujo The winner ! (Just add for two antislash pls)

Thank you very very much.

Rick_Araujo · March 12, 2022, 2:36pm

This is not a competition. I do those things (sometimes) because I can, not for collecting likes.

But, yes, I´ll modify it to your need.

Rick_Araujo · March 12, 2022, 2:59pm

Private Function UnicodeEscapedStringToUTF8(escapedString As String, useDoubleBackslashes As Boolean = true) As String
  
  // Convert the output of a MacOS "defaults read whatever" to proper UTF-8
  // Unicode chars are encoded as "\\Uxxxx" where xxxx is the hex codepoint
  // Passing useDoubleBackslashes = false it process it C style "\Uxxxx"
  
  Const escapeDoubleBackslashes As String = &uFFF9+"\"+&uFFFb
  
  Var UtfString  As String = escapedString.ReplaceAll( _
  If(useDoubleBackslashes, "\\\\", "\\" ), escapeDoubleBackslashes _
    )
    
    Var re As new RegEx
    Var match As RegExMatch
    
    re.SearchPattern = If(useDoubleBackslashes, "\\\\", "\\" )+"[Uu][0-9a-fA-F]{4,4}"
    
    match = re.Search(UtfString)
    
    Do until match = Nil
      Var found, code As String
      found = match.SubExpressionString(0)
      code = Text.FromUnicodeCodepoint(Integer.FromHex(found.Right(4)))
      UtfString = UtfString.Replace(found, code)
      match = re.Search(UtfString)
    Loop
    
    Return UtfString.ReplaceAll(escapeDoubleBackslashes, "\")
    
End Function

Denis_DUBOIS · March 12, 2022, 3:54pm

Thank you very much !!!

Denis_DUBOIS · March 13, 2022, 9:14am

Very sorry, I have a little problem ! the variable “ur” does not return the string identical to the plist!

Shell1.Execute ("defaults read com.apple.screencapture name")
//ud is CFString not a String
ud = Shell1.Result
//MessageBox(ud)
TextFieldName.Text = EscapedUnicodeToUTF8(ud, true)

TextFieldName.Text = « Capture d’écran à Denis »
.plist ‘name key’ = “Capture d’\U00e9cran \U00e0 Denis”; <—this has been converted
variable ud return = Capture d’\351cran \340 Denis !!! <----- But the unicodes to translate are these !

Rick_Araujo · March 13, 2022, 11:15am

This isn’t unicode, this is the C octal format for a byte representation

Is it using a single backslash this time?

You said last time the what you presented as “Capture d’\U00e9cran \U00e0 Denis” was really “Capture d’\\U00e9cran \\U00e0 Denis”

Denis_DUBOIS · March 13, 2022, 11:23am

Yes. Just a single backslash.

Rick_Araujo · March 13, 2022, 11:28am

So you have these 2 kinds of encodings: \\Uhhhh and \ooo correct?

Double backslash for unicode, single backslash for octal bytes, correct?

Denis_DUBOIS · March 13, 2022, 11:40am

No, only 1 type \ooo :

Capture d’\351cran \340 Denis >> “Capture d’écran à Denis”

Just décode ascii values behind 1 antislash

Rick_Araujo · March 13, 2022, 11:44am

How you mistakenly informed us since the beginning that you had an encoding \Uhhhh and latter \\Uhhhh and now you say it was a mistake and it’s only \ooo ?

If are having a bad time trying to express yourself in English, write in French.

Rick_Araujo · March 13, 2022, 11:58am

I’ll tailor the function to you just for your 2 possible infomed use cases: \\Uhhhh and \ooo

Rick_Araujo · March 13, 2022, 12:10pm

I’m reading about the real results of this command and it differs from what you said, it outputs the standard C encoding

% defaults write com.example.encoding room -string "Baño😀" 
% defaults read com.example.encoding room                  
Ba\361o\ud83d\ude00

Which are \ooo \uhhhh and \Uhhhhhhhh

So I’ll adapt my code for the standard, and you will adapt your code to follow it.

Denis_DUBOIS · March 13, 2022, 12:23pm

Yes, I will adapt my code to follow it. Apparently no “\U” in the results (\Asciii code)
MessageBox(ur) gives “Capture \351cran \340 Denis”. Just this.

Rick_Araujo · March 13, 2022, 12:28pm

When a Unicode is higher than \377 it should (because max octal could reach \777) be encoded as \uhhhh (lowercase “u”). As those chars you used in your example didn’t, they used the short \ooo instead.

Denis_DUBOIS · March 13, 2022, 12:35pm

I think you just have to detect the antislash in the (ur) string and then convert each ascii value behind (chr(value)) to get the accented letter etc, going up the whole string.
Look :