Problem with fixing Windows1252 mojibake

I have a problem fixing Windows1252 mojibake. I know how to fix mojibake - first do a convertencoding and then a defineencoding. I also have code to recognise different types of mojibake. I have one user where I can’t fix the mojibake and I’m not sure what is happening. I can’t even reproduce the issue myself. For me with the original data everything looks fine.

The original result:

1 x Bett Simple Hi 79 (ga.10708.11014) =3.906,00 EUR
Lieferzeit: 6-8 Wochen Amerikanischer Nussbaum: 180 x 200 cm Farbe: Holz natur geölt
SonderlÀnge/-gröÃe: SonderlÀnge 220 cm
Rahmenhöhe: Rahmenhöhe 32 cm

2 x Lattenrost Physiophorm S (ga.10708.11156) =702,00 EUR Lieferzeit: 3-5 Wochen GröÃe: 90 x 220 cm

Trying to fix this by doing:

theString = theString.ConvertEncoding(Encodings.WindowsANSI)
theString = theString.DefineEncoding(Encodings.UTF8)

However, the result is (from a different email):

1 x Nachttisch Xanadu (ga.10708.11087) =617,00 EUR
Lieferzeit: 6-8 Wochen OberflËche: auf Farbton gebeizt und seidenglËnzend lackiert (standard)
Holz: Buche auf Kirsche gebeizt
Fu?: Alu-Fu?abschluss ---------------------------------------------------------------------- Zwischensumme:617,00 EUR
abzíglich Rabattgutschein - NEU-KUNDE: -25,00 EUR
abzíglich Online-Rabatt:-18,51 EUR
Lieferung per Spedition:30,00 EUR
inkl. 19% MwSt.:96,36 EUR
Endsumme:603,49 EUR

Does anyone have an idea what I’m doing wrong here?

I’m surprised that converting to WindowsANSI and then just straight out defining as UTF8 works for you in any scenario.

DefineEncoding is for when you have a string, already know its encoding (because you know where it came from and know that the current encoding is nil or just incorrect) and just want to set it to that so it can be rendered correctly by the OS. When you do this on a string that was set to WindowsANSI first, I would expect the success to be random at best.

ConvertEncoding on the other hand, changes bytes, using the current encoding and the new encoding as a translation table to change bytes as necessary to render correctly in the new encoding.

So when I looked at your code above, my immediate reaction is that you should be using DefineEncoding first and then ConvertEncoding,

2 Likes

That was my reaction too.

This is fixing mojibake which are foobared characters. To unfoobar I need to do the inverse which means ConvertEncoding followed by DefineEncoding. DefineEncoding doesn’t do anything to fix the mojibake.

Looks like it is not UTF-8 read wams windows ANSI, but some other encoding like MacRoman.

Um, IIRC mojibake is the garbled text you get as the result of decoding the original text with the wrong encoding, like if the text is WindowsANSI and you instead tell it that it’s UTF8 (like you are doing above). You get incorrect characters where any of the multi-byte characters are and the replacement char � for things it can’t figure out. To me, that means that your assumption about the original provenance of the text is incorrect.

1 Like

IIMHO, if the ConvertEncoding ~> DefineEncoding thing worked for you in the past, you just got lucky.

1 Like

I have some incorrect assumption but I don’t know where. I have checks for multiple mojibake variations. And - really - my code fixes the mojibake. I even have fixed double mojibake this way. See Recognising and fixing Mojibake - #12 by Julia_Truchsess . And no I didn’t get lucky.

@Christian_Schmitz : if this was MacRoman then the macRoman mojibake checker would find and fix that. I even have the raw data and can’t reproduce the problem.

Var utf8Chars As String = "[ä] [ö] [ü] [€] [ß]"

Var garbled As String = utf8Chars.DefineEncoding(Encodings.WindowsANSI)

break
Var utf8Chars As String = "[ä] [ö] [ü] [€] [ß]"

Var garbled As String = utf8Chars.DefineEncoding(Encodings.ISOLatin1)

Break

So the ö is clearly ISO Latin1 for an UTF-8 ö.

1 Like

Windows-1252 is commonly known as Windows Latin 1 or as Windows West European or something like that. It differs from ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

From https://stackoverflow.com/questions/19109899/what-is-the-exact-difference-between-windows-1252-and-iso-8859-1 . So it shouldn’t matter if I use ISO Latin 1 or Windows1252.

Mojibake happens if you do one of these things

  • assume a string in one encoding is in another encoding

Now I come to think about it, the above list is actually comprehensive.

Of course, undoing gets more complicated, if, after you made some wrong assumption about a string’s encoding, you convert it to a different one.

And after you made such a conversion, you can start again with assuming a different encoding for the new text.

To undo mojibake, you need to do the steps done to the original (correct) text to get to your current text in reverse order.

So, in fact you need to start processing your mojibake by defining the encoding first. If no encoding steps had been made, you are done at this step.
If not you need to add more conversions, each consisting of ConvertEncoding followed by DefineEncoding.

And as a last step, you probably want to convert the encoding of the now correct original text to UTF-8, so it fit’s in with Xojo’s defaults.

In your case you probably assume (or even know) that you have UTF-8 (i.e. your string.Encoding = Encodings.UTF8), because whatever processes your emails, will have had to apply the content transfer coding from the email to the body part already, as you say in the other thread, that you do not have access to the raw binary email data. If you do have access to the email’s metadata, in particular to the transfer encoding defined there, it might be advantageous to convert your text back to what was there in the mime part in the fist place before trying to undo mojibake.

If you indeed assume or know to have UTF-8 anyway, you may have left out that first step, and thereby caused misunderstandings.

And if you have a html document, you obviously need to make sure the metadata for the encoding in the html document is interpreted correctly to match the encoding you finally converted it to. With http, this information, in general, comes from that protocol, with mime, that would probably correspond to the transfer encoding field in the mime part header. With an isolated html document, that information comes from header field, or defaults to ISO-8859-1 for HTML-4 and before and to UTF-8 for HTML-5.

@Stefan_von_Allmen Is this from AI?

I have the original data but I can’t reproduce the problem. I’ll post code with example data tomorrow.

No, this is HI (human intelligence).

I don’t thing AI was capable of writing something like

, at least not yet.

If the problem is repeatable on the customer’s system, but you cannot reproduce it on yours, this must mean something different happens on the customer’s system than on yours. Looking at the data will not help in that situation.

Does he run the most recent version of your program? If not, let him update, or test it yourself with his version (I assume you have revision control or some other kind of archive of “versions in the wild”).

Back to basics: I saved a mail from my Yahoo inbox and examine it. soon after the , I see:


<meta http-equiv="content-type" content="text/html; charset=UTF-8">

I suppose in your case, you get windows1252.

I suppose you check that and act accordingly ?

Emile, Beatrix is in this market since forever, so she knows the guts of the protocol.
What she is living is something new to her, and with the raw original user content (zipped, not copied/pasted in the forum, she said “tomorrow”) we could try to understand and see if we can help.

4 Likes

I’ve made an example:
mojibake.zip (21.5 KB)

The example has 2 parts. Test 1 has a part of the original email and assigns the correct encoding.

dim s as String = DecodeQuotedPrintable(mail)
s = DefineEncoding(s, GetInternetTextEncoding("iso-8859-15"))
s = s.ConvertEncoding(Encodings.UTF8)

There isn’t anything problematic in the email that I can see.

Test 2 tries to fix the Mojibake:
dim m as String = EncodingChecker.FixMisencoding(MojiBakeText)

As soon as I start copy-and-pasting the text around the data changes. The email is in a file for this reason.

The Mojibake fix does not work in the example and I get the same wrong result as the user does. I already found a problem when the characters are decomposed that my code to recognise the mojibake doesn’t work.

Not in the example is the creation of the html.

I haven’t followed the entire thread, but something immediately caught my eye in your last post (which I unfortunately can’t verify myself at the moment), so here’s a quick question:

Is it really correct to decode the QuotedPrintable before the string encoding has been defined? Wouldn’t it be safer to do it the other way around?

Quoted printable is ASCII, you only get the real data after decoding.

1 Like