Recognizing bad encoding data

Some enterprising soul sends emails with a Greek encoding but labels it as UTF8. This ends up as

[quote]Σε ??νέ?εια ?η? ενημέρ??η? για ?ον νέο λογ???ο, ε?ι??νά??ε?αι μία ?ελίδα
με κά?οια δείγμα?α α?? ?α αρ?εία ?ο? θα βρεί?ε ??ον ??ε?ικ? ?ύνδε?μο ?ο[/quote]

I can fix the wrong encoding data with the Python library ftfy (still need to make an app but that’s another problem).

How do I recognize when the encoding is bad? “Encodings.UTF8.IsValidData(myData)” is true and not false.

Well, this may be valid UTF-8, as someone may have read some encoding as some other encoding and than encoded the rest as UTF-8.

It’s not mojibake (I love this word). The encoding simply is wrong and I know how to fix it. I just need to know when to run the fixer code.

it is unlikely bad encoding would be valid UTF-8.

It’s unlikely but isValidData is true and not false for utf8. You can try with the above text.

Well always you can use a hard manual way to examine a incoming data and do a little data parsing.

You can make a set of data samples which are incoming and then to see if that samples are part of incoming data.

If they are then you can use your app to fix that incoming complete string.

That can be internal part of your app or external.

@Bogdan Pavlovic : I’m parsing emails. This particular email says:

[quote]------=_20180115103056_91465
Content-Type: text/plain; charset=“utf-8”
Content-Transfer-Encoding: 8bit[/quote]

Usually I fish out the encoding value either from the Content-Type or the Content-Transfer-Encoding. Additionally, I can check the html. Then I set the encoding:

if theCharset = "iso-8859-1" then currentBody = defineEncoding(currentBody, Encodings.ISOLatin1) ElseIf theCharset = "macintosh" then currentBody = defineEncoding(currentBody, Encodings.MacRoman) elseif theCharset = "iso-8859-2" then currentBody = defineEncoding(currentBody, Encodings.ISOLatin2) elseif theCharset = "Windows-1252" then currentBody = defineEncoding(currentBody, Encodings.WindowsANSI) ' and so on.

The problem with my customers mail is that they say utf8 so my code treats them as utf8. But the text is encoded in Greek. I can fix the encoding problem with ftfy:

[code]dim theFolderitem as FolderItem = folderItemUtils.getTempFolderitemUU
dim theBinStream as BinaryStream
theBinStream = BinaryStream.Create(theFolderitem)
theBinStream.Write(currentBody)
theBinStream.Close

dim theShellScript as string
dim theShell as new Shell

theShellScript = "/usr/local/bin/ftfy " + theFolderitem.ShellPath
theShell.execute theShellScript
if theShell.ErrorCode = 0 then
currentBody = theShell.Result
end if[/code]

All of this is not the problem. I only need to know WHEN the text is screwed up.

Well start from start, when e-mail message is made and sent by user.

Checkout which mail client they are using for e-mail sending and maybe you can see then where problems starts.

Also try to determine their system settings which they are using on their workstation as well.

Consider server side mail server settings and so called transport hub on server end how is processing incoming mails before stores to DBMS.

Be aware that server end also can make a glitch when data are stored in db on mail server as well and make issue with data encoding too.

Regarding to encoding and Xojo.
Personally I found big issue where base64 encoding don’t want to work well if you want to transfer data from/to Mac OS and iOS using Xojo.

Till today this issue doesn’t have 100% working solution and only way was to use Hex encoding which is a bigger in data size.

This means that also you could expect issue and problems in data encoding conversion too!

Friendly suggest if you are good Python then you should try also this with Python and see how far you will go and then back to Xojo app dev and make it all in it.

Uh, that isn’t really helping. I’m getting the data from Mail as is. I know that Mail tends to change the data - most likely this is why the data is screwed up. But this doesn’t matter because the customer uses Mail and the data is what it is.

I have an algorithm for parsing the mail parts. I just need to figure out here how to recognise bad data.

Since you have mails and have all contents access you can then see by inspecting data contents which are coming with greece letters and then even if is false positive to checkout parsing action according to your end that you know how to fix that when it’s come as bad encoding/contents according to you.

  • Also try to use byte safe conversion as well.

  • Make a rainbow table of all greece letters and compare vs mail contents too as any checking inside code tool :slight_smile:

You didn’t explain how you getting mail contents before doing encoding conversion with code above since even in DBMS that data is stored in DBMS with some encoding too.

Can you post some foo examples in classic text files (provide links for download) not as post string here as

  • Data foo example As is data contents of mail body before data conversion (dump from mail server DBMS) and

  • Data foo example after data contents conversion of first one

  • Optional data foo example of corrected one by your end as it should when data is false positive

The problem is that these particular mails aren’t coming with greek characters. And of course, I want my code to be more generic. At this point there is no DBMS involved because the data is parsed before it is written to the database.

I’m now experimenting with NSLinguisticTaggerMBS. If I examine the words in the first 100 characters the tagger recognises no real words as the words are only one character long. Besides ro there is no dominant language found.

Have you tried Joe Strout’s GuessEncoding method? I got it from XDevSpot a long time ago, no idea where it might be now.

@John McKernon: I have this code. But it’s mostly for very obvious issues and mostly checks the BOM.

If you only need to detect that the file is incorrectly encoded, then the way I see it, aside from manually examining the file, there are two options.

First, look for byte sequences that are not legal in a utf-8 file. When the file is imported into a text or string variable, I assume that Xojo will convert these into the Unicode replacement character &uFFFD. You can search for this, and if found, you know the file is bad, and where it’s bad.

Second, it’s possible that the file was somehow turned into utf-8 in such a way that there are no illegal byte sequences. In this case it is a valid utf-8 file, even though the characters are wrong. For this you could do a statistical analysis of the characters in the file. For example, for any particular language, the majority of characters should fall into a narrow group of codepoints. For Latin coding they should be in the basic Latin group; for Greek they should fall into the Greek group; for Cyrillic they should fall into the Cyrillic group; and so on. If they seem to be randomly distributed across many different language groups, or if there is a strange ratio of vowels to consonants, then there is a high probability of incorrect encoding.

Years ago, I used this statistical technique to break a simple substitution encryption scheme (nothing malicious).

Not sure if this is what you’re looking for, but you could give it a try. Following on what I wrote above, it does a simple statistical analysis of the text, and tries to guess whether it’s a valid utf-8 encoding. If so, it tries to guess the language.

Function GuessLanguage(myData As String) as String dim myDataLen As Integer = len(myData) dim p As Integer = InStr(myData,&uFFFD) if p>0 then return "Illegal character at position "+str(p) dim e,e2,stDev,skippedCharRatio As Double = 0 dim codePoint,nChars,mean,nonNumeric As Integer = 0 dim language As String = "" 'calculate sum and sum of squares of codePoints for i As Integer = 1 to myDataLen codePoint=asc(mid(myData,i,1)) 'skip numerals, punctuation and chars above &h700 if codePoint > 64 then nonNumeric=nonNumeric+1 if codePoint< &h700 and not(codePoint>&h7F and codePoint<&hC0) then e=e+codePoint e2=e2+codePoint*codePoint nChars=nChars+1 end if end if next 'Now calculate mean and standard deviation e=e/nChars e2=e2/nChars stDev = sqrt(e2-e^2) mean = e+0.5 'round to integer skippedCharRatio=nonNumeric/nChars if skippedCharRatio>2 then 'This happens if there is a large percentage of skipped characters language = "Unknown" ElseIf mean >= &h0041 and mean <= &h007A then language = "Latin" ElseIf mean >= &h007B and mean <= &h00AF then 'This occurs if a large percentage of characters are in the &h0080..&h00FF range 'If the mean value is close to the low end of the 7A..FF range then it may just 'be highly accented Latin language = "Accented Latin" ElseIf mean >= &h00B0 and mean <= &h00FF and stDev <40 then 'This occurs if a large percentage of characters are in the &h0080..&h00FF range 'If the standard deviation is small and the mean is in the middle or 'upper end of this range, then mis-encoding is more likely. language = "Mis-encoded" ElseIf mean >= &h007B and mean <= &h00FF then 'If the mean is in the range &h0080..&h00FF, and neither of the two preceding 'conditions apply, then there's not enough info to make a guess. language = "Unknown" ElseIf mean >= &h0391 and mean <= &h03c9 then language = "Greek" ElseIf mean >= &h0410 and mean <= &h052F then language = "Cyrillic" ElseIf mean >= &h0530 and mean <= &h058F then language = "Armenian" ElseIf mean >= &h0590 and mean <= &h05FF then language = "Hebrew" ElseIf mean >= &h0600 and mean <= &h06FF then language = "Arabic" 'More ElseIf cases can be included here for other languages else language = "Unknown" end if language=language+", Mean = &h"+hex(mean)+", StDev = "+str(stDev)_ +", SkpRatio = "+str(skippedCharRatio)+EndOfLine return language End Function

In the analysis, it skips codepoints below &h40 and codepoints &h80…BF which include common punctuation, numerals, and less common symbols, none of which would provide any useful information. It also skips very high codepoints for languages which it doesn’t try to identify. In the result, the standard deviation should be of the same order of magnitude as the number of characters in the applicable alphabet. The smaller the StDev value, the better. Latin, Cyrillic and Greek have an average of 25 letters, multiplied by 2 to include upper and lower case, we get a count of 50. The standard deviation should be smaller than this, unless there is a very large number of accented characters. Also, SkpRatio is the ratio of characters that were skipped because they were outside the range of interest, divided by the number of characters that were in the range of interest. Again, the smaller the better. So the values of StDev and SkpRatio should be used to determine how good the guess is. The most difficult case is in distinguishing heavily accented Latin text from mis-encoded text. So, this area may need some fine tuning.

The Custom Edit Field class that is floating around (I think Thomas Templeman took it over)… anyways there is a GUESSENCODING method in that

@Robert Weaver: thanks, that looks like an interesting idea. I’ll have a look.

I should clarify that using the snippet of text that you included in your original post, this function will return “mis-encoded,” not “Greek.”

Most of the Windows and ISO encodings are 256 character sets with the lower 128 codepoints being the standard ASCII Latin alphabet, and the codepoints from &h80 to &hFF representing the alphabet of the language in question. So, if we see a large number of codepoints in the &h80 to &hFF range, we only know that it’s probably a mis-encoded language, but it’s not easy to tell which one. On the other hand, if it’s properly encoded utf-8, then the codepoints will be distinctly grouped in the correct Unicode codepoint ranges which are easy to detect.

I mentioned that the most difficult problem is in differentiating between heavily accented Latin and a mis-encoding. Heavily accented Latin will include many characters in the &h80 to &hFF range, but should still include many characters in the &h40 to &h7F range which will draw the mean value downwards. On the other hand, if this is, for example, ISO 8859-7 (Greek) encoding, masquerading as utf-8, then we would see most of the characters in the &hC0 to &hFF range which would have mean value closer to &hE0, and a smaller standard deviation, which we should be able to detect.

As I mentioned, the text snippet from your first post correctly resulted in “mis-encoded.” I also tried it with other samples of text, and it correctly identified them. For heavily accented Latin I used a sample written in Czech from cs.wikipedia.org, as I couldn’t think of any other Latin language that would be more heavily accented. It correctly returned “Accented Latin.”

Well, if you have code like this:

[code]dim t as string = “??? ?? ???, ??? ??? ???”

t = ConvertEncoding(t, encodings.UTF8)
t = DefineEncoding(t, Encodings.ISOLatin1)

Break

// returns
// ??? ?η ?ικι?αίδεια, ?ην ελεύθερη εγκ?κλο?αίδεια
[/code]

So the text from Beatrix is probably just UTF-8 read as ISO Latin.

Reversed:

[code]dim t as string = “??? ?η ?ικι?αίδεια, ?ην ελεύθερη εγκ?κλο?αίδεια”

t = ConvertEncoding(t, encodings.ISOLatin1)
t = DefineEncoding(t, encodings.UTF8)

Break // shows “??? ?? ???, ??? ??? ???”[/code]

So the text Beatrix receives is valid UTF-8 with unusual code points. But it can be reversed and checked for validity.

here the test code:

[code]dim t as string = “??? ?η ?ικι?αίδεια, ?ην ελεύθερη εγκ?κλο?αίδεια”

t = ConvertEncoding(t, encodings.ISOLatin1)

if encodings.UTF8.IsValidData(t) then
t = DefineEncoding(t, encodings.UTF8)

break // valid UTF-8 sent as ISO Latin encoded in UTF-8
else
break
end if[/code]