TextInputStream Encoding


I’m having some trouble studying the language reference. I have a text file encoded as ANSI. If I read it using TextInputStream and specify WindowsLatin1 (ANSI) as the encoding, does the string, which I have e.g. by

Dim s As String = TextStream.ReadAll

read, the ANSI encoding or automatically the UTF-8 encoding used internally by Xojo and desired by me or do I have to convert anything?

ReadAll returns a string in the encoding you specify. It’s the same as using DefineEncoding on the string. It doesn’t verify that the string is valid in that encoding, it just blindly sets the encoding value. Note that if you don’t set any encoding, the string will have a Nil encoding, not UTF-8. If you want UFT8, you have to explicitly set it.

Thanks Tim for your feedback. What do I need to modify at this small sample to get a true UTF-8 String?

[code]Dim s As String
Dim input As TextInputStream

input = TextInputStream.Open(f)
input.Encoding = Encodings.WindowsLatin1 // Ansi
s = input.ReadAll // DefineEncoding or ConvertEncoding ?

When you say “encoded as ANSI” you mean that the string you read in is intended to be considered as a WindowsLatin1 string? Given that you set the encoding of the input to that, then I would guess you could do:

Dim txt as text = s.totext // t now contains a set of unicode 16-bit codepoints Dim st as string // Has UTF-8 encoding by default st = t // Converts back to a UTF-8 string.

The file I have is encoded in WindowsLatin1. As @Tim Hare has already written, I have to convert the data to UTF-8 myself, otherwise the default UTF-8 string encoding that any string property declaration has will be overwritten. That’s how I understood it. Am I wrong? As for the fact that Xojo deprecated the text type, I don’t want to use ToText.

That is what my sequence above is intended to do.

My understanding is that the encoding is just a label applied to a string telling everyone how that set of bytes is supposed to be interpreted. And that by doing this:

input.Encoding = Encodings.WindowsLatin1 s = input.ReadAll
what you have ended up with is string s containing the data, and with the encoding label set to WindowsLatin1. So far so good. But you want a UTF-8 string, so by converting to text (using .ToText) you create a Unicode thing (a series of 16-bit items, one per character), which is a universal representation. That in turn is then converted to a UTF-8 string by the final step, because the second string has the UTF-8 encoding before some data is assigned to it. That encoding tells Xojo how to make that string.

The point is that changing the encoding just changes the label, the bytes stay the same. To have the bytes changed too (otherwise your string will be wrongly interpreted) you have to tell Xojo how to do it.

Perhaps a shorter way is to do:

input.Encoding = Encodings.WindowsLatin1 s = ConvertEncoding (input.ReadAll, Encodings.UTF8))

But I’m not sure.

Perhaps Tim Hare will add something more.

Sorry - not sure what you mean here.

Do you really need to convert the string to UTF8? Xojo will handle your WindowsLatin1 string just fine.

The answer to the original question is: Yes, the string will be WindowsLatin1, not UTF8.

Once you have read the string in with the proper encoding, if you do want to change the encoding, then use ConvertEncoding. But let’s not try to complicate matters by insisting on using UTF8. The main thing is to get a string with the correct encoding. It doesn’t really matter what that encoding is, as long as it is correct.

Thank you for your comments, Tim. It’s only important to me that the data is then available in UTF-8, as it is written to an SQLite database after it has been read. And it encodes in UTF-8. I now use

s = input.ReadAll.ConvertEncoding(Encodings.UTF8)

Thank you both, Tims. :wink:

Yes, thanks to TimHare too. It took me a long time to get to a point where I believe I understand the string/text difference.