ASCII is a subset of UTF-8. UTF-8 includes all of US-ASCII as the first 128 code points.
If you concatenate this with a string that uses code points that are not in the US-ASCII set you get a UTF-8 string:
Dim s As String
s = Format( 45, "####" )
MsgBox s.Encoding.internetName // US-ASCII
s = s + &u255
MsgBox s.Encoding.internetName // UTF-8
UTF-8 is a variable length encoding and for code points up to 127 will use a single byte exactly the same as US-ASCII - that’s part of the definition of UTF-8.
The only difference is one says US-ASCII and one says UTF-8". The bytes are identical functionally - if you concatenate strings they do the right thing.
That’s the beauty of UTF-8, that code points less than 128 are single-byte and the same as ASCII. That makes it much easier for programmers (like me) who used to think of strings as a sequence of single-bytes. Up to 127, they still are, and that’s why UTF-8 as the default encoding in Xojo was such a good choice.
Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no
Those first few characters are known as a BOM and are not part of the UTF-8 data itself. Whatever software is creating these files is putting that in there and other software may use that to know how the data is written.
[quote=375356:@Jeff Tullin]Not so fast.
I know of a couple of apps that expect a text file, and choke when they get a UTF8 file, because they expect an ANSI one.
They look the same in notepad, and dont have any double byte chars
But the actual file begins with a couple of bytes BEFORE the first letter which you cannot see by eye.
Off topic for the OPs question, but just to note… the same text ABCDEF in a UTF8 formatted file is not the same file as the ANSI version of the same. Single bytes or no[/quote]
This has nothing to do with Xojo. If your Xojo program writes the file, it will not contain the BOM unless you explicitly write it. As Jason points out, whatever program you’re using to create the file is explicitly writing a BOM marker into it.
Except Jason (a Xojo engineer) just said it’s not a bug.
Can you think of a case where this statement would be wrong?
To put it another way, if you write a string encoded as ASCII and later read it as UTF-8, it would be 100% correct. If you concatenate an ASCII-encoded string to UTF-8, you’d get UTF-8 and it too would be 100% correct. Below code point 128, ASCII and UTF-8 are interchangeable, and ASCII-encoded string are always below 128, so what does it matter?
(The assumption is that you are working with strings that Xojo encoded for you internally. Defining improper encodings in code is problematic no matter the encodings involved.)
[quote=375746:@Kem Tekinay]Except Jason (a Xojo engineer) just said it’s not a bug.
Can you think of a case where this statement would be wrong?
[/quote]
Whether it would hurt or not I don’t know. At best it’s inconsistent; perhaps the doccy should be changed.