ASCII Encoding question.

Michel_Bujardet · July 10, 2014, 10:01pm

[quote=110256:@Richard Summers]I had a reasonable understanding, and then you informed me that extended ASCII does not really exist.

That then confused me, as everywhere you seem to look - it says that standard ASCII uses (0-127), and that extended ASCII uses 128 upwards.

I therefore presumed that the symbol would be standardised if extended ASCII was used - but now there is NO extended ASCII??
:([/quote]

To make things even worse, “extended ASCII” is an obsolete notion from when fonts had only 256 characters. Today fonts can have thousands of characters, even tens of thousands of characters (Asian languages for instance). The only real coding that is both universal and truly standardized is Unicode, which governs the way a character will be fetched by a system independently of its place in font.

For instance, a system font encoded for Windows and one for Macintosh will place the Euro symbol in different places into the font (as well as most accented characters). Yet, they will display fine extended characters on a Mac or on a Windows machine. I have fonts coded for Windows that work perfectly on Mac OS X and others created for Mac that work fine on Windows, thanks to Unicode. So the Euro will display fine as well, because it is encoded as &U20AC, so wherever it has been placed into the font, the system will be able to find it by its Unicode value. It is indeed standardized, but not as any “extended ASCII”

Sometimes, old free fonts found on the Internet are only compliant to old fashion encodings and do not use Unicode. Such fonts often display accented characters in a very weird way on modern systems.

Richard_Summers · July 10, 2014, 10:28pm

Woweeeee - that was some serious information!
Maybe I’ll come back in 20 years or so and try again

Thank you all for the information - much appreciated.

Tim_Hare · July 10, 2014, 10:42pm

So what was your reason for wanting to display the extended ascii value? And would displaying the unicode value work just as well? That can be done with

// DISPLAY THE UTF8 CODE POINT
  Dim ConvertedToUTF8Encoding as String = ConvertEncoding(inStringToConvert,Encodings.UTF8)
  Dim ConvertedCodePoint as Integer = Asc(ConvertedToUTF8Encoding)
  AsciiField.AppendText Str(ConvertedCodePoint)

Norman_P · July 10, 2014, 10:43pm

Text Encodings are not unlike anything else on a computer.
You have a bunch of “bytes”
If I told you it was a computer program - thats minimally helpful.
You’d need to know for what CPU, OS etc in order to know what they really mean and maybe, if you were so inclined, to figure out what the program did. How to disassemble it. Or what “encoding” it was.

Text is kind of like that.
If you just have a bunch of bytes (which is all you ever really have) unless you know what “encoding” they are in they are just a bunch of bytes.
Once you know what encoding they are supposed to be for you can go through them and turn them into the right “glyphs” on screen. There’s not a one to one match as some encodings like UTF-8 may use a bunch of bytes to represent one “code point” and some “code points” can be combined to make one “character” - so things like e and ’ can be combined to give you .

Its all horribly more complex than it really should be BUT heres the upside you mostly don’t have to know HOW it all works
99.9% of this is handled for you.
There are times when you have to worry about it - when you get data from outside of your application; say from a file, database, serial port, socket etc. In those cases you get “bytes” and you have to tell Xojo what encoding that bunch of bytes should be treated as - this is the notion of defining the encoding (DefineEncoding). That literally says to Xojo - that bunch of bytes should be treated as if it is in this encoding.

For something entirely IN Xojo you can use the code I posted - that one liner WILL give you the right Euro symbol.
But as Michel stated that does not mean it WILL show on screen.
A font may not have the glyph (the thing you and I see on screen or on paper) to display the character.
That can happen but switching fronts may show it as another font may have the glyph.

So the business of “bunch of bytes” to “string in my app that has a known encoding” to “thing user see’s on screen” involves a lot of software - most of which you can blissfully ignore

Richard_Summers · July 10, 2014, 11:05pm

I was simply going to do a test:

When I clicked inside textfield1 and then pressed a key (any key) on my keyboard - I wanted textfield2 to display the ASCII value.
My original code worked fine for normal numbers and letters etc. (0-127).

However, when I pressed ALT-2 for example (), it displayed a value of 226, which did not match the value on a website which claimed the Extended ASCII value was 128. link to the website

Therefore, I thought that maybe my code was only using the standard ASCII values, thus the false value being given.
However, thanks to you guys - I now realise that extended ASCII is nothing but a varied myth

You have all now cleared this up for me, and I will not try to ascertain the ASCII value of a depressed key, because above 127, it will be inconsistent.

Phew - hope that made my original objective clearer

Tim_Hare · July 10, 2014, 11:19pm

And that ConvertEncoding bit requires that you have previously defined the encoding correctly. And it is mostly not required. But to be absolutely certain, you can keep it in. Most of the time the strings in your program will be UTF8 already. You might get a string from outside the program that is in a different encoding. You would have to first Define the encoding and then Convert it to UTF8.

Michel_Bujardet · July 10, 2014, 11:25pm

Since it has been added to fonts in the 90’s, the Euro symbol only certain value is its Unicode point of 20AC. The site you refer to describes the ASCII position chosen at the time by Microsoft for Windows. At the same time, Apple had chosen the ASCII position 219 (which was previously “currency” ¤ that Microsoft places at 164). In early 90s fonts, then, there is no Euro symbol and you will probably get a rectangle. In modern fonts, thanks to the magic of Unicode, you will get the Euro symbol no matter if it is at position 128 or 219.

Actually, the value of 226 maybe right as well, but you do not have to worry about it. Let the system deal with Unicode points transparently.

Encoding is like health : as long as it is good, nobody thinks about it. Only when health fails, or when an app starts displaying weirdos like black lozenges with a question mark over them, you think about aspirin and encoding

Richard_Summers · July 10, 2014, 11:56pm

I really appreciate all the info
However, I am not parsing any text from a file of any kind.

Example:
The code below will check for the Return key being pressed:

if asc(key)=13 then MsgBox("hello") end if

All good so far.
However, supposing I wanted to check if the user entered the symbol - according to my original code, I would need to check for the value of 226:

if asc(key)=226 then MsgBox("hello") end if

However, that would not work, as above 127 - everything can be varied.
I’m presuming therefore, that because Xojo uses UTF-8 by default, that I only need to check the UTF-8 value of whatever key the user pressed???

Tim_Hare · July 11, 2014, 4:55am

Btw, the bytes that make up the symbol are hex E2 82 AC, or in decimal 226 130 172. So you’re only getting the value of the first byte for some reason. That would be consistent with converting the encoding to ASCII first. If you leave the encoding alone, you should get 8364, or hex 20AC.

And yes, in KeyDown, you should only get UTF8 strings, so you don’t have to worry about encodings or anything but UTF8 code point values.

Michel_Bujardet · July 11, 2014, 8:51am

[quote=110353:@Richard Summers]if asc(key)=226 then
MsgBox(“hello”)
end if[/quote]

Why convert to ASCII ? Take advantage of Unicode and forget what is under the hood

Function KeyDown(Key As String) As Boolean if key = &u20AC then msgbox key end if End Function

Michael_Hußmann · July 11, 2014, 8:53am

[quote=110353:@Richard Summers]However, supposing I wanted to check if the user entered the symbol - according to my original code, I would need to check for the value of 226:

if asc(key)=226 then MsgBox("hello") end if

However, that would not work, as above 127 - everything can be varied.
I’m presuming therefore, that because Xojo uses UTF-8 by default, that I only need to check the UTF-8 value of whatever key the user pressed???[/quote]

Actually you dont need to worry about code points at all; just test for the Euro symbol:

if key="" then MsgBox("hello") end if

This will work regardless of the encoding, provided the encoding is known.

Michael_Hußmann · July 11, 2014, 9:17am

Put differently: if you want to know whether some character is the Euro symbol, then compare it against the Euro symbol. Simple as that. What your code does translates to: Is the code point of the character equal to 226 which happens to be the code point of the Euro symbol in the encoding I assume the string to be in (although I didnt actually check). That is needlessly complex and brittle (as it relies on an assumption).

Richard_Summers · July 11, 2014, 9:18am

Thanks Michael, and everyone else.

So if I am checking for the return key, spacebar, tab key etc. - I need to check for the UTF-8 values (13, 32, 9 - because Xojo uses UTF-8 by default), but if checking for certain keys (A, B, C, etc.), I can check directly.

Is that correct?
I was just unsure if I needed to check ONLY for UTF-8 values when checking for keys such as tab and return etc.

Tim_Hare · July 11, 2014, 5:01pm

You need to check the ASC value, because you cannot easily type those characters (except space) into your code. That is the only reason to use ASC - to check for non-printable characters, like return and tab. Any printable character, such as , you can check directly.

This has drifted a long way from the original question.

Richard_Summers · July 11, 2014, 6:41pm

Ok,
so when checking for non-printable characters, why do I have to check for ASC values and not UTF-8, if UTF-8 is what Xojo uses by default?

It has drifted, but still along the same lines.

Michel_Bujardet · July 11, 2014, 7:05pm

As says so well Wikipedia at UTF-8 - Wikipedia

[quote]UTF-8 (UCS Transformation Format8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. [/quote] Non printable characters cannot be represented (displayed), hence they have no Unicode value.

Steve_Wilson · July 11, 2014, 7:19pm

[quote=110529:@Richard Summers]Ok,
so when checking for non-printable characters, why do I have to check for ASC values and not UTF-8, if UTF-8 is what Xojo uses by default?[/quote]Stick with UTF8, I think Tim inadvertently confused you.

UTF8 does include non-printable characters.

[quote=110353:@Richard Summers]Example:
The code below will check for the Return key being pressed:

if asc(key)=13 then MsgBox("hello") end if

All good so far.
However, supposing I wanted to check if the user entered the symbol - according to my original code, I would need to check for the value of 226:

if asc(key)=226 then MsgBox("hello") end if

However, that would not work, as above 127 - everything can be varied.
I’m presuming therefore, that because Xojo uses UTF-8 by default, that I only need to check the UTF-8 value of whatever key the user pressed???[/quote]

You could turn these around if it helps:

If key = Encodings.UTF8.Chr(13) Then MsgBox "tis but a scratch.  "
If key = Encodings.UTF8.Chr(226) Then MsgBox "A scratch? Your arm's off. "

Tim_Hare · July 11, 2014, 8:05pm

ASC <> ASCII

Syed_Hassan · July 11, 2014, 10:08pm

The “Variables” view during breakpoints shows that the string variable Key passed to KeyDown event has US-ASCII encoding for ASCII control chracters and UTF-8 encoding for the rest of the characters (note that all characters were not tested with breakpoint).

Character £ is encoded in UTF-8 having two octets C2 A3. Backspace is encoded in US-ASCII having single octet 08. Letter A is encoded in UTF-8 having single octet 41.

Michel_Bujardet · July 11, 2014, 11:02pm

UTF-8 is meant to display Unicode glyphs. By definition a non-displayable character has no Unicode code point. So control characters cannot be UTF-8.

The only valid reference for Unicode points is provided by the Unicode Consortium, at http://www.unicode.org and a table for basic Latin is at http://www.unicode.org/charts/PDF/U0000.pdf which shows control characters for convenience, but without any glyph representation. In a font, control characters have simply no code point at all, especially not the same as their ASCII value …

So checking for control keys has do be done with ASC when it is recommended to check displayable characters by glyphs (UTF-8 string).