Format() and Str() return strings in ASCII encoding, not UTF8

I was surprised to discover that the Format and Str functions return strings encoded as US-ASCII, not UTF8. Is there a good reason for this?

For example:

Dim Astr as String
Dim Bstr as String

Astr=format(45,"#######")
Bstr=str(45)

Both Astr and Bstr are US-ASCII encoded.

I’m just curious why this is an exception to the default UTF8 encoding.

Well, if there is no non ASCII character, why not mark it as ASCII?

[code]Sub Action() Handles Action
Dim enc As TextEncoding

Dim s As String
MsgBox s.Encoding.internetName // US-ASCII

s = “Whatever”
MsgBox s.Encoding.internetName // UTF-8

s = Str( “Whatever” )
MsgBox s.Encoding.internetName // UTF-8

s = Format( 45, “####” )
MsgBox s.Encoding.internetName // US-ASCII
End Sub[/code]

But guys, he’s only asking why this is.
I’m curious too.

Me too. Seems illogical.

ASCII is a subset of UTF-8. UTF-8 includes all of US-ASCII as the first 128 code points.
If you concatenate this with a string that uses code points that are not in the US-ASCII set you get a UTF-8 string:

Dim s As String 
s = Format( 45, "####" )
MsgBox s.Encoding.internetName // US-ASCII
s = s + &u255
MsgBox s.Encoding.internetName // UTF-8

@Jason: that doesn‘t exactly answer the question …

One Byte instead of two ?

Because it literally doesn’t matter at all that it is US-ASCII.

The curiosity I have is that the documentation states Strings default to UTF-8, so why does str() return one that’s ASCII?

UTF-8 is a variable length encoding and for code points up to 127 will use a single byte exactly the same as US-ASCII - that’s part of the definition of UTF-8.

The only difference is one says “US-ASCII” and one says “UTF-8". The bytes are identical functionally - if you concatenate strings they do the right thing.

That’s the beauty of UTF-8, that code points less than 128 are single-byte and the same as ASCII. That makes it much easier for programmers (like me) who used to think of strings as a sequence of single-bytes. Up to 127, they still are, and that’s why UTF-8 as the default encoding in Xojo was such a good choice.

Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no

Those first few characters are known as a BOM and are not part of the UTF-8 data itself. Whatever software is creating these files is putting that in there and other software may use that to know how the data is written.

see: https://en.wikipedia.org/wiki/Byte_order_mark

[quote=375356:@Jeff Tullin]Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no[/quote]
This has nothing to do with Xojo. If your Xojo program writes the file, it will not contain the BOM unless you explicitly write it. As Jason points out, whatever program you’re using to create the file is explicitly writing a BOM marker into it.

In fact you’re asking why the encoding is ASCII instead of UTF8. Seems to me that’s a bug. Of course the content of the bytes is the same.

Except Jason (a Xojo engineer) just said it’s not a bug.

Can you think of a case where this statement would be wrong?

To put it another way, if you write a string encoded as ASCII and later read it as UTF-8, it would be 100% correct. If you concatenate an ASCII-encoded string to UTF-8, you’d get UTF-8 and it too would be 100% correct. Below code point 128, ASCII and UTF-8 are interchangeable, and ASCII-encoded string are always below 128, so what does it matter?

(The assumption is that you are working with strings that Xojo encoded for you internally. Defining improper encodings in code is problematic no matter the encodings involved.)

[quote=375746:@Kem Tekinay]Except Jason (a Xojo engineer) just said it’s not a bug.

Can you think of a case where this statement would be wrong?
[/quote]
Whether it would hurt or not I don’t know. At best it’s inconsistent; perhaps the doccy should be changed.

Because it is inconsistent, contradicts the documentation, and confuses people (as demonstrated by this thread).

[Tim beat me by a few seconds :wink: ]