Some enterprising soul sends emails with a Greek encoding but labels it as UTF8. This ends up as
[quote]Σε ??νέ?εια ?η? ενημέρ??η? για ?ον νέο λογ???ο, ε?ι??νά??ε?αι μία ?ελίδα
με κά?οια δείγμα?α α?? ?α αρ?εία ?ο? θα βρεί?ε ??ον ??ε?ικ? ?ύνδε?μο ?ο[/quote]
I can fix the wrong encoding data with the Python library ftfy (still need to make an app but that’s another problem).
How do I recognize when the encoding is bad? “Encodings.UTF8.IsValidData(myData)” is true and not false.
Usually I fish out the encoding value either from the Content-Type or the Content-Transfer-Encoding. Additionally, I can check the html. Then I set the encoding:
if theCharset = "iso-8859-1" then
currentBody = defineEncoding(currentBody, Encodings.ISOLatin1)
ElseIf theCharset = "macintosh" then
currentBody = defineEncoding(currentBody, Encodings.MacRoman)
elseif theCharset = "iso-8859-2" then
currentBody = defineEncoding(currentBody, Encodings.ISOLatin2)
elseif theCharset = "Windows-1252" then
currentBody = defineEncoding(currentBody, Encodings.WindowsANSI)
' and so on.
The problem with my customers mail is that they say utf8 so my code treats them as utf8. But the text is encoded in Greek. I can fix the encoding problem with ftfy:
[code]dim theFolderitem as FolderItem = folderItemUtils.getTempFolderitemUU
dim theBinStream as BinaryStream
theBinStream = BinaryStream.Create(theFolderitem)
theBinStream.Write(currentBody)
theBinStream.Close
dim theShellScript as string
dim theShell as new Shell
theShellScript = "/usr/local/bin/ftfy " + theFolderitem.ShellPath
theShell.execute theShellScript
if theShell.ErrorCode = 0 then
currentBody = theShell.Result
end if[/code]
All of this is not the problem. I only need to know WHEN the text is screwed up.
Well start from start, when e-mail message is made and sent by user.
Checkout which mail client they are using for e-mail sending and maybe you can see then where problems starts.
Also try to determine their system settings which they are using on their workstation as well.
Consider server side mail server settings and so called transport hub on server end how is processing incoming mails before stores to DBMS.
Be aware that server end also can make a glitch when data are stored in db on mail server as well and make issue with data encoding too.
Regarding to encoding and Xojo.
Personally I found big issue where base64 encoding don’t want to work well if you want to transfer data from/to Mac OS and iOS using Xojo.
Till today this issue doesn’t have 100% working solution and only way was to use Hex encoding which is a bigger in data size.
This means that also you could expect issue and problems in data encoding conversion too!
Friendly suggest if you are good Python then you should try also this with Python and see how far you will go and then back to Xojo app dev and make it all in it.
Uh, that isn’t really helping. I’m getting the data from Mail as is. I know that Mail tends to change the data - most likely this is why the data is screwed up. But this doesn’t matter because the customer uses Mail and the data is what it is.
I have an algorithm for parsing the mail parts. I just need to figure out here how to recognise bad data.
Since you have mails and have all contents access you can then see by inspecting data contents which are coming with greece letters and then even if is false positive to checkout parsing action according to your end that you know how to fix that when it’s come as bad encoding/contents according to you.
Also try to use byte safe conversion as well.
Make a rainbow table of all greece letters and compare vs mail contents too as any checking inside code tool
You didn’t explain how you getting mail contents before doing encoding conversion with code above since even in DBMS that data is stored in DBMS with some encoding too.
Can you post some foo examples in classic text files (provide links for download) not as post string here as
Data foo example As is data contents of mail body before data conversion (dump from mail server DBMS) and
Data foo example after data contents conversion of first one
Optional data foo example of corrected one by your end as it should when data is false positive
The problem is that these particular mails aren’t coming with greek characters. And of course, I want my code to be more generic. At this point there is no DBMS involved because the data is parsed before it is written to the database.
I’m now experimenting with NSLinguisticTaggerMBS. If I examine the words in the first 100 characters the tagger recognises no real words as the words are only one character long. Besides ro there is no dominant language found.
If you only need to detect that the file is incorrectly encoded, then the way I see it, aside from manually examining the file, there are two options.
First, look for byte sequences that are not legal in a utf-8 file. When the file is imported into a text or string variable, I assume that Xojo will convert these into the Unicode replacement character &uFFFD. You can search for this, and if found, you know the file is bad, and where it’s bad.
Second, it’s possible that the file was somehow turned into utf-8 in such a way that there are no illegal byte sequences. In this case it is a valid utf-8 file, even though the characters are wrong. For this you could do a statistical analysis of the characters in the file. For example, for any particular language, the majority of characters should fall into a narrow group of codepoints. For Latin coding they should be in the basic Latin group; for Greek they should fall into the Greek group; for Cyrillic they should fall into the Cyrillic group; and so on. If they seem to be randomly distributed across many different language groups, or if there is a strange ratio of vowels to consonants, then there is a high probability of incorrect encoding.
Years ago, I used this statistical technique to break a simple substitution encryption scheme (nothing malicious).
Not sure if this is what you’re looking for, but you could give it a try. Following on what I wrote above, it does a simple statistical analysis of the text, and tries to guess whether it’s a valid utf-8 encoding. If so, it tries to guess the language.
Function GuessLanguage(myData As String) as String
dim myDataLen As Integer = len(myData)
dim p As Integer = InStr(myData,&uFFFD)
if p>0 then return "Illegal character at position "+str(p)
dim e,e2,stDev,skippedCharRatio As Double = 0
dim codePoint,nChars,mean,nonNumeric As Integer = 0
dim language As String = ""
'calculate sum and sum of squares of codePoints
for i As Integer = 1 to myDataLen
codePoint=asc(mid(myData,i,1))
'skip numerals, punctuation and chars above &h700
if codePoint > 64 then
nonNumeric=nonNumeric+1
if codePoint< &h700 and not(codePoint>&h7F and codePoint<&hC0) then
e=e+codePoint
e2=e2+codePoint*codePoint
nChars=nChars+1
end if
end if
next
'Now calculate mean and standard deviation
e=e/nChars
e2=e2/nChars
stDev = sqrt(e2-e^2)
mean = e+0.5 'round to integer
skippedCharRatio=nonNumeric/nChars
if skippedCharRatio>2 then
'This happens if there is a large percentage of skipped characters
language = "Unknown"
ElseIf mean >= &h0041 and mean <= &h007A then
language = "Latin"
ElseIf mean >= &h007B and mean <= &h00AF then
'This occurs if a large percentage of characters are in the &h0080..&h00FF range
'If the mean value is close to the low end of the 7A..FF range then it may just
'be highly accented Latin
language = "Accented Latin"
ElseIf mean >= &h00B0 and mean <= &h00FF and stDev <40 then
'This occurs if a large percentage of characters are in the &h0080..&h00FF range
'If the standard deviation is small and the mean is in the middle or
'upper end of this range, then mis-encoding is more likely.
language = "Mis-encoded"
ElseIf mean >= &h007B and mean <= &h00FF then
'If the mean is in the range &h0080..&h00FF, and neither of the two preceding
'conditions apply, then there's not enough info to make a guess.
language = "Unknown"
ElseIf mean >= &h0391 and mean <= &h03c9 then
language = "Greek"
ElseIf mean >= &h0410 and mean <= &h052F then
language = "Cyrillic"
ElseIf mean >= &h0530 and mean <= &h058F then
language = "Armenian"
ElseIf mean >= &h0590 and mean <= &h05FF then
language = "Hebrew"
ElseIf mean >= &h0600 and mean <= &h06FF then
language = "Arabic"
'More ElseIf cases can be included here for other languages
else
language = "Unknown"
end if
language=language+", Mean = &h"+hex(mean)+", StDev = "+str(stDev)_
+", SkpRatio = "+str(skippedCharRatio)+EndOfLine
return language
End Function
In the analysis, it skips codepoints below &h40 and codepoints &h80…BF which include common punctuation, numerals, and less common symbols, none of which would provide any useful information. It also skips very high codepoints for languages which it doesn’t try to identify. In the result, the standard deviation should be of the same order of magnitude as the number of characters in the applicable alphabet. The smaller the StDev value, the better. Latin, Cyrillic and Greek have an average of 25 letters, multiplied by 2 to include upper and lower case, we get a count of 50. The standard deviation should be smaller than this, unless there is a very large number of accented characters. Also, SkpRatio is the ratio of characters that were skipped because they were outside the range of interest, divided by the number of characters that were in the range of interest. Again, the smaller the better. So the values of StDev and SkpRatio should be used to determine how good the guess is. The most difficult case is in distinguishing heavily accented Latin text from mis-encoded text. So, this area may need some fine tuning.
I should clarify that using the snippet of text that you included in your original post, this function will return “mis-encoded,” not “Greek.”
Most of the Windows and ISO encodings are 256 character sets with the lower 128 codepoints being the standard ASCII Latin alphabet, and the codepoints from &h80 to &hFF representing the alphabet of the language in question. So, if we see a large number of codepoints in the &h80 to &hFF range, we only know that it’s probably a mis-encoded language, but it’s not easy to tell which one. On the other hand, if it’s properly encoded utf-8, then the codepoints will be distinctly grouped in the correct Unicode codepoint ranges which are easy to detect.
I mentioned that the most difficult problem is in differentiating between heavily accented Latin and a mis-encoding. Heavily accented Latin will include many characters in the &h80 to &hFF range, but should still include many characters in the &h40 to &h7F range which will draw the mean value downwards. On the other hand, if this is, for example, ISO 8859-7 (Greek) encoding, masquerading as utf-8, then we would see most of the characters in the &hC0 to &hFF range which would have mean value closer to &hE0, and a smaller standard deviation, which we should be able to detect.
As I mentioned, the text snippet from your first post correctly resulted in “mis-encoded.” I also tried it with other samples of text, and it correctly identified them. For heavily accented Latin I used a sample written in Czech from cs.wikipedia.org, as I couldn’t think of any other Latin language that would be more heavily accented. It correctly returned “Accented Latin.”