How convert string to Unicode?

Hi,
I must to write a string in a.plist file but the french accents are encoded in Unicode v1.1.0.
https://www.fileformat.info/info/unicode/version/1.1/index.htm
For examples, the french string : “Capture d’écran” becomes “ Capture d\U2019e\U0301cran”
How do I convert my string? Anyone have a conversion routine ?
I saw the ‘GetTextConverter’ and ‘GetTextEncoding’ commands…
Thanks

Are you trying to add to the plist, or to read from it ?

And by the way, U2019 and U0301 is NOT Unicode.

Where does your string come from? Do you read it in from a file? And when you say that the first string “becomes” the second, how is this transformation happening? Once you explain this it may be possible to suggest some code.

Of course they are Unicode:

U+2019	’	e2 80 99	RIGHT SINGLE QUOTATION MARK
U+0301 	́	cc 81	    COMBINING ACUTE ACCENT

Why do you keep saying they aren’t ?

1 Like

Of course they are Unicode:

Is there a command to retrieve the Unicode code of a character (the equivalent of .asc(char) for ASCII) ?
It may be a beginning of solution …

Asc(char) will return the Uncode code for char. It’s just badly named.

According to the doc, only for ASCII characters.

Assuming your source has a given known encoding.

Var cSource as String = ????
Var cUnicode as String

cSource = cSource.DefineEncoding( Encodings.WhatEverItIs )
cUnicode = cSource.ConvertEncoding( Encodings.UTF8 )

https://documentation.xojo.com/api/text/encoding_text/defineencoding.html

https://documentation.xojo.com/api/text/encoding_text/convertencoding.html

From the language reference:

The Asc function returns the code point for the first character in the passed String in the characters encoding. Characters 0 through 127 are the standard ASCII set, which are the same on practically every encoding.

MessageBox(Asc(“é”).ToString) = 233 ! Not U+0301
utf-16 encoding and Unicode v1.1.0 should be considered.

If you want UTF16:

cUnicode = cSource.ConvertEncoding( Encodings.UTF16 )

Don’t try and convert the string yourself.

Right, Unicode is funny that way. There are two ways to represent that (and other) characters. One is a single code point, and the other is using two (or more) code points.The former is “composed”, the latter “decomposed”. Both are right, but are a headache for us who have to deal with these things behind the scenes.

I put out a Unicode Normalization package, if you’re interested. That will convert composed strings to decomposed and vice versa.

Var u As String = "é" // Utf-8 bytes &hC3 + &hA9
Var l As Integer = u.Bytes // 2
Var n As Integer = Asc(u) // &hC3

I get &hE9 with that code, which is right.

Thanks.
I have these some data to help you:

‘ : U+2019 : RIGHT SINGLE QUOTATION MARK (UTF-16 : 0x2019 (2019)
é : U+0301 : COMBINING ACUTE ACCENT (UTF-16 : 0x0301 (0301)
ê : U+0302 : COMBINING CIRCUMFLEX ACCENT

Yes it would be:

U+00E9 é c3 a9 LATIN SMALL LETTER E WITH ACUTE

E9 is 233 in decimal.

Thanks, I’d overlooked that. I think that doc item could be improved.

I don’t need help. I was just sending some info.

To be fair, I just don’t get what is your problem. What do you want to achieve?

2 Likes

Denis, again: what is it you are trying to do ?

Are you trying to read from a plist ? Could you post your code ?

Oh yes, not bad! Which commands to go from 233 (dec) to E9 (hexadec) ?

This is a good question which I also asked upthread (where does the data come from?) but @Denis_DUBOIS hasn’t addressed it yet. I suspect the answer will involve what @Ian_Kennedy suggested upthread - a mixture of DefineEncoding and ConvertEncoding.