Encoding problem with AppleScript

The normal situation for my app is to get data directly out of Mail from the harddisk. Encodings are okay there. The fallback solution - for instance needed for Gmail - is to get the data via AppleScript. I noticed a nasty encoding problem: the bytes are different and when assigning the correct encoding I get garbage.

The script itself is very simple:

tell application "Mail" set theMailbox to mailbox ("xxx") of mailbox ("yyy") set theSource to source of message 1 of theMailbox end tell

Let’s say that the mail contains the umlaut ü. This is the byte sequence C3 BC in UTF8. When doing the AppleScript I get C3 82 C2 BC instead. Which is “ü” instead of an ü.

Does anyone have an idea how to get a nice ü out of this mess?

Yosemite, Xojo 2014r2/r3.

Beatrix,

you could use “content” instead of “source” in AppleScript.

  • source (text, r/o) : Raw source of the message – I don’t know the encoding for that

For example:

tell application "Mail"
	set theMailbox to mailbox ("xxx") of account ("yyy")
	set theSender to sender of message 1 of theMailbox
	set theSubject to subject of message 1 of theMailbox
	set theSource to content of message 1 of theMailbox
	set theMail to theSender & return & return & theSubject & return & return & theSource	
end tell

And if you need the header too, you can use: set theHeader to all headers of message 1 of theMailbox

Thanks for the idea, Wolfgang. That would be the last possibility for me, because I have a full mail parser AFTER I get the source of a mail.

I was recently struggling with a similar issue.

Encodings out of Mail.app are, apparently, a mess. Mail.app is handling the encoding correctly internally and, using drag and drop, you can get text in the proper endcoding (content is always in UTF-8, it seems), but if you save the text or transfer it via AppleScript, Mail goes on a wild randomization spree and you can’t be sure of exactly what you will get.

Although I figured out a solution to my particular issue (with Wolfgang’s help), since it uses the content of the message and not the source, I don’t know if it will work for you.

That said, while I was struggling with the Mail.app issue, I also looked at various other e-mail clients. The state of Mac e-mail clients is pretty all over the place. But, one constant seems to be their lack of good AppleScript support.

The one exception to that I found was Outlook (Outlook 2011, not the subscription-only Outlook that was just recently released, which I haven’t tried). Say what you will about Outlook (it is overkill for what most people actually need, has several interface attrocities and can be buggy in may different ways), but it has an incredibly rich AppleScript dictionary and it’s handling of encodings seems to be rock-solid. I was able to use it during my battles with Mail.app without having to change my Xojo application at all, and it worked very well.

It’s entirely possible that Outlook 2011 is not an option for you, but I wanted to throw this out as a potential solution, just in case.

in recent OS versions applescript standardized itself, separate from the rest of the OS, to use UTF16 encoding throughout. If you’re getting data from applescript it may be UTF16 encoded which can be garbage for high order glyphs if you just define the encoding to UTF8. Try doing a define encoding on the data you’re getting back as encodings.utf16 and see if that makes it all work. Once you set the encoding to utf16 then you can convert the encoding to UTF8 without it getting garbage too, but it may be enough to just define the proper encoding for the string from applescript. Or that might have nothing to do with it :wink:

@Scott: I have this working for Outlook. My application supports all major mail clients. And yes, the more modern ones don’t have any AppleScript support, which means that my app can’t support them.

@James: The data comes in as UTF16 via the MBS plugin. But when I peek at the raw data and convert this to UTF8 I get the same mess as before.

Could this be a problem of precomposed vs. decomposed characters? The fact that I got 4 bytes instead of 1 or 2 is strange.

Ah… got it. The application is for more general use. For some reason, I got it in my head that you were working on a specific use application.

Unfortunately, in my (admittedly limited) tests, what I was getting out of AppleScript from Mail.app was not UTF-16. If was UTF-8. AppleScript may use UTF-16 internally, but it appears it tries to maintain the encoding of the source material when passing strings from one application to another.

From Memory, I beleive in Applescript you can change the text encoding.

For example

set myText to "This text as UTF8" as text 

set myOtherText to "This text as UTF16" as unicode text

I could be wrong with this, but worth trying out.

Regards Mark

to convert to UTF16 just do

set MyText to something as unicode text

to convert back to UTF8 you have to go a bit further and do something like;

set MyText to something as «class utf8»

AppleScript internally handles data as UTF16 and converts the result to UTF8. But using the MBS plugin I can peek at the original UTF16 and this already has the incorrect encoding.

I need a break before I have a look at this problem again.

[quote=163295:@Beatrix Willius]Let’s say that the mail contains the umlaut ü. This is the byte sequence C3 BC in UTF8. When doing the AppleScript I get C3 82 C2 BC instead. Which is “ü” instead of an ü.

Does anyone have an idea how to get a nice ü out of this mess?[/quote]

I knew I had seen this before. Look at
http://stackoverflow.com/questions/14980200/converting-special-charactes-such-as-ü-and-Ã-back-to-their-original-latin-alp

Thanks, Michel! I will have a look.