ASCII to UTF8 Follies

John_McKernon · September 15, 2021, 8:55pm

Once again, I am wrestling with converting encodings. I thought it was working well, thanks to Joe Strout’s GuessEncoding function, but today a customer sent me a file that upends everything. I’m working in macOS.

The issue is the degree symbol, which is generated on Windows by pressing Alt+176 within Excel (or any other app, according to the user). When I read a file containing this, the content correctly shows a degree symbol when viewed in Xojo’s inspector with ASCII encoding. If I view the content as UTF8, then Xojo shows (on MacOS) a diamond with a question mark inside it. When the content is printed, an infinity symbol is drawn instead of the degree symbol.

The same thing happens when I use .ConvertEncoding to convert the original ASCII string to UTF8. The degree symbol prints and displays onscreen as infinity. It does not convert to the unicode version of a degree symbol, which is what I was expecting.

Here’s the relevant code:

'Fi is the FolderItem passed to the method

DimTextInput as TextInputStream
Dim FileChunkStr as String
Dim ResultStr as String

DimTextInput = TextInputStream.Open(Fi)    'Open the file
TextInput.Encodidng = Encodings.ASCII.   'Because I know the file is ASCII
FileChunk = Textinput.ReadAll.   'Get the file content

'Look at FileChunk in the Inspector, it shows the degree symbol and says encoding is ASCII.

ResultStr=FileChunk.ConvertEncoding(Encodings.UTF8)

Look at ResultStr in the Inspector, it says it’s UTF8, and has a black diamond instead of the degree symbol. Printing ResultStr or drawing it in a graphics instance shows it as an infinity symbol.

I also tried using a TextConverter:

Dim tc As TextConverter
tc = GetTextConverter(GetTextEncoding(&h0600),GetTextEncoding(&h0600))
Dim ResultStr As String
ResultStr = tc.convert(FileChunk)

ResultStr still shows as the black diamond and prints infinity, not the degree symbol. Clearly I don’t understand what ConvertEncoding and TextConverter are designed to do.

Thoughts? Suggestions?

John

Tim_Hare · September 15, 2021, 9:10pm

It’s not ASCII, it’s WindowsLatin1.

TimStreater · September 15, 2021, 9:22pm

This is not an ASCII character, so it’s unsurprising that it doesn’t convert. See the upper table at:

Actual ASCII, BTW, is already UTF8.

Brandon_Warlick · September 15, 2021, 9:38pm

This reminds me… I ran across a bug in GuessEncoding. I think I got my copy of the method from @Kem_Tekinay’s website

Look for this line:

elseif b0=&hEF and b1=&hBB and b1=&hBF then

It should be:

elseif b0=&hEF and b1=&hBB and b2=&hBF then

John_McKernon · September 15, 2021, 9:57pm

Umm, those two lines look the same…?

Brandon_Warlick · September 15, 2021, 9:59pm

Look closer

John_McKernon · September 15, 2021, 10:00pm

OMG, you’re right. I’m having eye problems these days, that one was tricky. Thanks!

Tim_Hare · September 15, 2021, 11:32pm

In case I was too terse earlier, set the encoding to Encodings.WindowsLatin1 when you read it in (instead of Encodings.ASCII) and it should convert just fine.

John_McKernon · September 15, 2021, 11:46pm

Having fixed that, unfortunately FNGuessEncoding doesn’t recognize the text as WindowsLatin1, which it turns out is the actual encoding. So I can either reinstate a user pref from several years ago where the user manually selected the encoding or figure out

Kem_Tekinay · September 16, 2021, 3:35am

Unfortunately there is no way to guess a single-byte encoding from the bytes alone.

John_McKernon · September 16, 2021, 4:18am

Yeah, I decided to offer them a popup menu to choose an encoding and they can take it from there. Not as user friendly as I’d like, but it’ll get the data.

Thanks!