ConvertEncoding eats text

Beatrix_Willius · October 11, 2015, 5:42am

My app parses mail: get ContentTransferEncoding and ContentType, decode mail, apply encoding and convert to UTF8. Which works mostly. Now I’ve got a hebrew html mail, where ConvertEncoding makes an empty string. Xojo 2014r2 and 2015r2. Mac OS 10.10.5.

Code:

dim f as FolderItem = GetOpenFolderItem("") dim b as BinaryStream = BinaryStream.Open(f) dim s as String = b.Read(b.Length) s = DecodeQuotedPrintable(s) dim theEncoding as TextEncoding = GetInternetTextEncoding("iso-8859-8") s = DefineEncoding(s, theEncoding) 'string shows okay here as hebrew s = ConvertEncoding(s, encodings.UTF8) 'string empty

Html:

[code]

=F7. =F9=E9=F8=E5=FA = =EE=F1. SC5015010146 =EC=EC=F7=E5=E7

[/code]

Does anyone have an idea what could cause this? Any idea for a workaround? I need the UTF8 because mails could have multiple parts that need to be added and then the data is written to Valentina.

Tim_Hare · October 11, 2015, 6:11am

Are you sure that’s the right encoding?

Eli_Ott · October 11, 2015, 6:12am

I think you should use:

dim theEncoding as TextEncoding = Encodings.ISOLatinHebrew

Eli_Ott · October 11, 2015, 6:12am

ISO-8859-8-I and ISO-8859-8 are not the same.

Beatrix_Willius · October 11, 2015, 6:36am

The encoding is correct and comes from the mail as

Content-Type: text/html; charset=“iso-8859-8”

Encodings.ISOLatinHebrew at least doesn’t eat my text. But what is the difference between ISO-8859-8-I and ISO-8859-8? Google wasn’t able to give me an answer.

Eli_Ott · October 11, 2015, 7:51am

The first to links in Google give you ISO-8859-8-I and ISO-8859-8. I just quickly read them (so I’m not really certain I understand it correctly), but ISO-8859-8 seems to be in logical order (left-to-right) and ISO-8859-8-I in visual order (right-to-left).

Beatrix_Willius · October 11, 2015, 8:24am

Overlooked it: ISO-8859-8-I… The characters are in logical order. … ISO-8859-8 is sometimes in logical order (HTML, XML), and sometimes in visual (left-to-right) order (plain text without any markup).

Still doesn’t explain the empty string after ConvertEncoding.

Tim_Hare · October 11, 2015, 9:23am

ISO-8859-8-I includes additional codes. If one of those were present in your text, ConvertEncoding would fail.

Eli_Ott · October 11, 2015, 10:05am

If fails even with ???.

Beatrix_Willius · October 11, 2015, 11:42am

I’ve been working with encodings (especially the messed up sort) for a long time now and I’ve NEVER seen this behavior. Also, the hebrew text is shown correctly in the debugger. Still confused…

Greg_O_Lone · October 12, 2015, 10:27am

Have you tried converting to UTF-16 instead? Perhaps UTF-8 doesn’t include the Hebrew glyphs?

Beatrix_Willius · October 12, 2015, 12:38pm

Trying to convert to UTF16 also gives empty text. The debugger shows the hebrew glyphs before trying to convert the encoding.

TomE · October 12, 2015, 12:52pm

I wonder why it should make any difference if the Unicode code point is shown as UTF-8, UTF-16, UTF-32 or whatever? These encodings are just different ways to point to the same Unicode character, or do I miss something?

Greg_O_Lone · October 12, 2015, 1:16pm

It shouldn’t. I mistyped. I was wondering if the font being used doesn’t include the glyphs, but if the debugger shows the characters before the conversion and not after, that’s probably not the case.

@Beatrix Willius you might try using the new Text type instead. You may have stumbled into an edge case that doesn’t work in String.

Beatrix_Willius · October 12, 2015, 1:23pm

@Greg: will do some testing. The code is part of a large parsing algorithm. I get heartburn when I think about converting this to text. Not going to happen soon. Also the data comes in without encoding and is given encoding only as very last step.