Encoding confusion

DaveS · June 18, 2016, 4:39pm

I have a text file coming in from an external source…
If I load this into a commerical Text Editor… it says it is a UTF-8 file and displays the contents exactly as expected.

HOWEVER, when I read it (TextInputStream) into an XOJO app, and split things up (based on SPACES and QUOTES)
The resulting file still loads into the same TextEditor as UTF-8, but now things look different

for example :
BEFORE I would see “±”
AFTER I see “Â±”

Now I do have code that loops over a line of text one character at a time

Just an example… “t=t+c” has more decision logic involved

for i=1 to len(s)
c=mid(s,i,1)
t=t+c
next i

BUT, “C” should be a CHARACTER… not a “BYTE”… (and yes I’m using STRINGS, not TEXT)

Norman_P · June 18, 2016, 4:51pm

Did you define the encoding anywhere after you read the data in and before you manipulate it ?

DaveS · June 18, 2016, 4:54pm

No…
I tried adding

s=ConvertEncoding(s,Encodings.UTF8)

immediately AFTER reading it (s=txt.READALL)
but it didn’t change the results
I also tried DEFINEENCODING… but it did nothing to affect the result either

    textIN=TextInputStream.open(filePath)
    s=textIN.ReadAll
    textIN.close
    //
    s=ReplaceAll(s,EndOfLine,EndOfLine.UNIX)
    s=ReplaceAll(s,chr(&h0b)," ")
    s=ReplaceAll(s,chr(&h0c)," ")
    //s=ReplaceAll(s,chr(&h00)," ")
    s=ReplaceAll(s,chr(&h09),"   ") // 3 spaces
    //s=ConvertEncoding(s,Encodings.UTF8)  // with or without this line (or as DEFINEEncoding, make no difference
    v=Split(s,EndOfLine.UNIX)

I then analyze each element in V

Norman_P · June 18, 2016, 5:11pm

ConvertEncoding is not correct unless the string already has an encoding defined
Start off with the data in a defined encoding with something like

  textIN=TextInputStream.open(filePath)
  textIn.Encoding = Encodings.UTF8
  s=textIN.ReadAll
  textIN.close

which is roughly the equivalent of

  textIN=TextInputStream.open(filePath)
  s=textIN.ReadAll
  textIN.close
  s = DefineEncoding(Encodings.UTF8)

Converting or defining the encoding AFTER changing the data with replacealls will give you grief

DaveS · June 18, 2016, 5:20pm

Thanks… but that too had no effect… but one thing I just noticed…
The INCOMING data (per 3rd party texteditor) is UTF-8
HOWEVER the OUTPUT file is NOT… it is ISO something

the output file is TEXTOUTPUTSTREAM using WRITE commands

Isn’t everything supposed to be UTF-8 unless specifically told otherwise?

Norman_P · June 18, 2016, 5:23pm

not when it comes from an outside source like a TCP socket, database, file, etc
this isnt new

DaveS · June 18, 2016, 5:43pm

[quote=272831:@Norman Palardy]not when it comes from an outside source like a TCP socket, database, file, etc
this isnt new[/quote]
Sorry… that I knew… I mean if it becomes UTF8 it stays UTF8 unless told otherwise.
And a New string inherits it encoding (or lack thereof) from its source, right

dim t as string=aUTF16string  // t is also going to be UTF16?

Norman_P · June 18, 2016, 5:49pm

[quote=272832:@Dave S]Sorry… that I knew… I mean if it becomes UTF8 it stays UTF8 unless told otherwise.
And a New string inherits it encoding (or lack thereof) from its source, right

dim t as string=aUTF16string // t is also going to be UTF16?[/quote]

You can break that
A string with and encoding + a “string” with nil encoding -> nil encoded result

Fundamentally these sort of issues are what prompted “text”

DaveS · June 18, 2016, 5:50pm

Ok… Sorry Norm… your suggestion DID work… Thank you
The output I was creating was an HTML file, and I didn’t tell IT to use UTF-8

<meta charset="utf-8"/>

adding that line, and I now see what I expected, and all it well

Tim_Hare · June 18, 2016, 6:40pm

No. UTF-8 is not the default on any OS. It is more correct to say that everything (external to Xojo) is not UTF-8 unless specifically told so. As you saw, the UTF-8 that Xojo wrote out was interpreted as something else (the OS default) until you explicitly included the meta tag.