Format() and Str() return strings in ASCII encoding, not UTF8

John_McKernon · February 27, 2018, 3:15pm

I was surprised to discover that the Format and Str functions return strings encoded as US-ASCII, not UTF8. Is there a good reason for this?

For example:

Dim Astr as String
Dim Bstr as String

Astr=format(45,"#######")
Bstr=str(45)

Both Astr and Bstr are US-ASCII encoded.

I’m just curious why this is an exception to the default UTF8 encoding.

Christian_Schmitz · February 27, 2018, 3:27pm

Well, if there is no non ASCII character, why not mark it as ASCII?

Markus_Winter · February 27, 2018, 3:36pm

[code]Sub Action() Handles Action
Dim enc As TextEncoding

Dim s As String
MsgBox s.Encoding.internetName // US-ASCII

s = “Whatever”
MsgBox s.Encoding.internetName // UTF-8

s = Str( “Whatever” )
MsgBox s.Encoding.internetName // UTF-8

s = Format( 45, “####” )
MsgBox s.Encoding.internetName // US-ASCII
End Sub[/code]

Tim_Parnell · February 27, 2018, 3:37pm

But guys, he’s only asking why this is.
I’m curious too.

Markus_Winter · February 27, 2018, 3:40pm

Me too. Seems illogical.

Jason_Parsley · February 27, 2018, 4:34pm

ASCII is a subset of UTF-8. UTF-8 includes all of US-ASCII as the first 128 code points.
If you concatenate this with a string that uses code points that are not in the US-ASCII set you get a UTF-8 string:

Dim s As String 
s = Format( 45, "####" )
MsgBox s.Encoding.internetName // US-ASCII
s = s + &u255
MsgBox s.Encoding.internetName // UTF-8

Markus_Winter · February 27, 2018, 4:38pm

@Jason: that doesnt exactly answer the question

Emile_Schwarz · February 27, 2018, 4:42pm

One Byte instead of two ?

Jason_Parsley · February 27, 2018, 4:46pm

Because it literally doesnt matter at all that it is US-ASCII.

Tim_Parnell · February 27, 2018, 4:48pm

The curiosity I have is that the documentation states Strings default to UTF-8, so why does str() return one that’s ASCII?

Jason_Parsley · February 27, 2018, 4:49pm

UTF-8 is a variable length encoding and for code points up to 127 will use a single byte exactly the same as US-ASCII - that’s part of the definition of UTF-8.

Jason_Parsley · February 27, 2018, 5:00pm

The only difference is one says US-ASCII and one says UTF-8". The bytes are identical functionally - if you concatenate strings they do the right thing.

Kem_Tekinay · February 27, 2018, 5:06pm

That’s the beauty of UTF-8, that code points less than 128 are single-byte and the same as ASCII. That makes it much easier for programmers (like me) who used to think of strings as a sequence of single-bytes. Up to 127, they still are, and that’s why UTF-8 as the default encoding in Xojo was such a good choice.

Jeff_Tullin · February 27, 2018, 5:43pm

Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no

Jason_Parsley · February 27, 2018, 6:35pm

Those first few characters are known as a BOM and are not part of the UTF-8 data itself. Whatever software is creating these files is putting that in there and other software may use that to know how the data is written.

see: https://en.wikipedia.org/wiki/Byte_order_mark

Tim_Hare · February 27, 2018, 7:55pm

[quote=375356:@Jeff Tullin]Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no[/quote]
This has nothing to do with Xojo. If your Xojo program writes the file, it will not contain the BOM unless you explicitly write it. As Jason points out, whatever program you’re using to create the file is explicitly writing a BOM marker into it.

TimStreater · March 2, 2018, 1:18pm

In fact you’re asking why the encoding is ASCII instead of UTF8. Seems to me that’s a bug. Of course the content of the bytes is the same.

Kem_Tekinay · March 2, 2018, 1:41pm

Except Jason (a Xojo engineer) just said it’s not a bug.

Can you think of a case where this statement would be wrong?

To put it another way, if you write a string encoded as ASCII and later read it as UTF-8, it would be 100% correct. If you concatenate an ASCII-encoded string to UTF-8, you’d get UTF-8 and it too would be 100% correct. Below code point 128, ASCII and UTF-8 are interchangeable, and ASCII-encoded string are always below 128, so what does it matter?

(The assumption is that you are working with strings that Xojo encoded for you internally. Defining improper encodings in code is problematic no matter the encodings involved.)

TimStreater · March 2, 2018, 2:01pm

[quote=375746:@Kem Tekinay]Except Jason (a Xojo engineer) just said it’s not a bug.

Can you think of a case where this statement would be wrong?
[/quote]
Whether it would hurt or not I don’t know. At best it’s inconsistent; perhaps the doccy should be changed.

Markus_Winter · March 2, 2018, 2:02pm

Because it is inconsistent, contradicts the documentation, and confuses people (as demonstrated by this thread).

[Tim beat me by a few seconds ]