TextInputStream Encoding

  1. last week

    Martin T

    Sep 12 Pre-Release Testers Germany
    Edited last week

    Hello,

    I'm having some trouble studying the language reference. I have a text file encoded as ANSI. If I read it using TextInputStream and specify WindowsLatin1 (ANSI) as the encoding, does the string, which I have e.g. by

    Dim s As String = TextStream.ReadAll

    read, the ANSI encoding or automatically the UTF-8 encoding used internally by Xojo and desired by me or do I have to convert anything?

  2. Tim H

    Sep 12 Pre-Release Testers Portland, OR USA

    ReadAll returns a string in the encoding you specify. It's the same as using DefineEncoding on the string. It doesn't verify that the string is valid in that encoding, it just blindly sets the encoding value. Note that if you don't set any encoding, the string will have a Nil encoding, not UTF-8. If you want UFT8, you have to explicitly set it.

  3. Martin T

    Sep 13 Pre-Release Testers Germany
    Edited last week

    Thanks Tim for your feedback. What do I need to modify at this small sample to get a true UTF-8 String?

    Dim s As String
    Dim input As TextInputStream
    
    input = TextInputStream.Open(f)
    input.Encoding = Encodings.WindowsLatin1 // Ansi
    s = input.ReadAll // DefineEncoding or ConvertEncoding ?
    input.Close
  4. Tim S

    Sep 13 Canterbury, UK

    When you say "encoded as ANSI" you mean that the string you read in is intended to be considered as a WindowsLatin1 string? Given that you set the encoding of the input to that, then I would guess you could do:

    Dim txt as text = s.totext        // t now contains a set of unicode 16-bit codepoints
    Dim st as string                  // Has UTF-8 encoding by default
    st = t                            // Converts back to a UTF-8 string.
  5. Martin T

    Sep 13 Pre-Release Testers Germany
    Edited last week

    The file I have is encoded in WindowsLatin1. As @Tim Hare has already written, I have to convert the data to UTF-8 myself, otherwise the default UTF-8 string encoding that any string property declaration has will be overwritten. That's how I understood it. Am I wrong? As for the fact that Xojo deprecated the text type, I don't want to use ToText.

  6. Tim S

    Sep 13 Canterbury, UK

    @MartinTrippensee The file I have is encoded in WindowsLatin1. As @Tim Hare has already written, I have to convert the data to UTF-8 myself

    That is what my sequence above is intended to do.

    otherwise the default UTF-8 string encoding that any string property declaration has will be overwritten.

    My understanding is that the encoding is just a label applied to a string telling everyone how that set of bytes is supposed to be interpreted. And that by doing this:

    input.Encoding = Encodings.WindowsLatin1
    s = input.ReadAll

    what you have ended up with is string s containing the data, and with the encoding label set to WindowsLatin1. So far so good. But you want a UTF-8 string, so by converting to text (using .ToText) you create a Unicode thing (a series of 16-bit items, one per character), which is a universal representation. That in turn is then converted to a UTF-8 string by the final step, because the second string has the UTF-8 encoding before some data is assigned to it. That encoding tells Xojo how to make that string.

    The point is that changing the encoding just changes the label, the bytes stay the same. To have the bytes changed too (otherwise your string will be wrongly interpreted) you have to tell Xojo how to do it.

    Perhaps a shorter way is to do:

    input.Encoding = Encodings.WindowsLatin1
    s = ConvertEncoding (input.ReadAll, Encodings.UTF8))

    But I'm not sure.

    That's how I understood it. Am I wrong?

    Perhaps Tim Hare will add something more.

    As for the fact that Xojo rejected the text type ...

    Sorry - not sure what you mean here.

  7. Tim H

    Sep 13 Pre-Release Testers Portland, OR USA

    Do you really need to convert the string to UTF8? Xojo will handle your WindowsLatin1 string just fine.

    The answer to the original question is: Yes, the string will be WindowsLatin1, not UTF8.

    Once you have read the string in with the proper encoding, if you do want to change the encoding, then use ConvertEncoding. But let's not try to complicate matters by insisting on using UTF8. The main thing is to get a string with the correct encoding. It doesn't really matter what that encoding is, as long as it is correct.

  8. Martin T

    Sep 13 Pre-Release Testers Germany

    Thank you for your comments, Tim. It's only important to me that the data is then available in UTF-8, as it is written to an SQLite database after it has been read. And it encodes in UTF-8. I now use

    s = input.ReadAll.ConvertEncoding(Encodings.UTF8)

    Thank you both, Tims. ;)

  9. Tim S

    Sep 13 Canterbury, UK

    Yes, thanks to TimHare too. It took me a long time to get to a point where I believe I understand the string/text difference.

or Sign Up to reply!