How convert string to Unicode?

Denis_DUBOIS · March 11, 2022, 3:41pm

Hi,
I must to write a string in a.plist file but the french accents are encoded in Unicode v1.1.0.
https://www.fileformat.info/info/unicode/version/1.1/index.htm
For examples, the french string : “Capture d’écran” becomes “ Capture d\U2019e\U0301cran”
How do I convert my string? Anyone have a conversion routine ?
I saw the ‘GetTextConverter’ and ‘GetTextEncoding’ commands…
Thanks

Michel_Bujardet · March 11, 2022, 4:20pm

Are you trying to add to the plist, or to read from it ?

And by the way, U2019 and U0301 is NOT Unicode.

TimStreater · March 11, 2022, 5:59pm

Where does your string come from? Do you read it in from a file? And when you say that the first string “becomes” the second, how is this transformation happening? Once you explain this it may be possible to suggest some code.

Of course they are Unicode:

U+2019	’	e2 80 99	RIGHT SINGLE QUOTATION MARK
U+0301 	́	cc 81	    COMBINING ACUTE ACCENT

Why do you keep saying they aren’t ?

Denis_DUBOIS · March 11, 2022, 6:38pm

Of course they are Unicode:

Is there a command to retrieve the Unicode code of a character (the equivalent of .asc(char) for ASCII) ?
It may be a beginning of solution …

Kem_Tekinay · March 11, 2022, 6:45pm

Asc(char) will return the Uncode code for char. It’s just badly named.

TimStreater · March 11, 2022, 6:50pm

According to the doc, only for ASCII characters.

Ian_Kennedy · March 11, 2022, 6:52pm

Assuming your source has a given known encoding.

Var cSource as String = ????
Var cUnicode as String

cSource = cSource.DefineEncoding( Encodings.WhatEverItIs )
cUnicode = cSource.ConvertEncoding( Encodings.UTF8 )

https://documentation.xojo.com/api/text/encoding_text/defineencoding.html

https://documentation.xojo.com/api/text/encoding_text/convertencoding.html

Kem_Tekinay · March 11, 2022, 6:52pm

From the language reference:

The Asc function returns the code point for the first character in the passed String in the characters encoding. Characters 0 through 127 are the standard ASCII set, which are the same on practically every encoding.

Denis_DUBOIS · March 11, 2022, 6:52pm

MessageBox(Asc(“é”).ToString) = 233 ! Not U+0301
utf-16 encoding and Unicode v1.1.0 should be considered.

Ian_Kennedy · March 11, 2022, 6:54pm

If you want UTF16:

cUnicode = cSource.ConvertEncoding( Encodings.UTF16 )

Don’t try and convert the string yourself.

Kem_Tekinay · March 11, 2022, 6:57pm

Right, Unicode is funny that way. There are two ways to represent that (and other) characters. One is a single code point, and the other is using two (or more) code points.The former is “composed”, the latter “decomposed”. Both are right, but are a headache for us who have to deal with these things behind the scenes.

I put out a Unicode Normalization package, if you’re interested. That will convert composed strings to decomposed and vice versa.

Rick_Araujo · March 11, 2022, 6:57pm

Var u As String = "é" // Utf-8 bytes &hC3 + &hA9
Var l As Integer = u.Bytes // 2
Var n As Integer = Asc(u) // &hC3

Kem_Tekinay · March 11, 2022, 7:00pm

I get &hE9 with that code, which is right.

Denis_DUBOIS · March 11, 2022, 7:00pm

Thanks.
I have these some data to help you:

‘ : U+2019 : RIGHT SINGLE QUOTATION MARK (UTF-16 : 0x2019 (2019)
é : U+0301 : COMBINING ACUTE ACCENT (UTF-16 : 0x0301 (0301)
ê : U+0302 : COMBINING CIRCUMFLEX ACCENT

TimStreater · March 11, 2022, 7:01pm

Yes it would be:

U+00E9 é c3 a9 LATIN SMALL LETTER E WITH ACUTE

E9 is 233 in decimal.

TimStreater · March 11, 2022, 7:02pm

Thanks, I’d overlooked that. I think that doc item could be improved.

Rick_Araujo · March 11, 2022, 7:02pm

I don’t need help. I was just sending some info.

To be fair, I just don’t get what is your problem. What do you want to achieve?

Michel_Bujardet · March 11, 2022, 7:05pm

Denis, again: what is it you are trying to do ?

Are you trying to read from a plist ? Could you post your code ?

Denis_DUBOIS · March 11, 2022, 7:05pm

Oh yes, not bad! Which commands to go from 233 (dec) to E9 (hexadec) ?

TimStreater · March 11, 2022, 7:07pm

This is a good question which I also asked upthread (where does the data come from?) but @Denis_DUBOIS hasn’t addressed it yet. I suspect the answer will involve what @Ian_Kennedy suggested upthread - a mixture of DefineEncoding and ConvertEncoding.