DecodeQuotedPrintable failure

Hmm. Just as I thought I’d sorted out problems related to decoding and conversions, along comes another. In this case, DecodeQuotedPrintable does not behave well in the case where what it is given to decode is not, in fact, quoted printable.

What happens is that I receive an email to be analysed. This particular one came from a spammer who insists that it is encoded as quoted printable when it is, in fact, ordinary text. That is, he lies. The ordinary text includes equals-signs (=), and it looks like DecodeQuotedPrintable just goes ahead and assumes that the two characters following the equals-sign are hex chars when in fact they are not.

The following code demonstrates what happens - a random character is generated. Seems to me I should report this as a bug since the obvious thing for DecodeQuotedPrintable to do when either of the two following chars is not a hex character is to just pass them on through. along with the equals sign.

  // The string showing in the msgbox ought to be the same as the
  // original string, rather than having a random character inserted.
  
  dim  intext, dmp As Text, i, lim As Integer, outtext As MemoryBlock
  
  intext  = "<html><head><style type=""text/css"">"
  outtext = DecodeQuotedPrintable(intext)
  
  dmp = ""
  lim = outtext.Size - 1
  
  for  i=0 to lim
    dmp = dmp + chr(outtext.Byte(i)).ToText
  next
  
  MsgBox ("Text: " + dmp + " hex: " + EncodeHex(dmp, true))

What do you think of a workaround: you check the first couple of = to see if the characters after it are valid. And if they are not you ignore the QuotedPrintable in the Content Transfer Encoding.

PS: Issues with encoding and mail NEVER stop.

The input string needs to be encoded to decode it with DecodeQuotedPrintable. The = character is the escape character in quoted printable strings (see Quoted Printable).

To encode the = character you need to use =3D or apply the EncodeQuotedPrintable to the input string. This will work:

intext = "<html><head><style type=3D""text/css"">" // or intext = EncodeQuotedPrintable("<html><head><style type=""text/css"">")

[quote=282017:@Eli Ott]The input string needs to be encoded to decode it with DecodeQuotedPrintable. The = character is the escape character in quoted printable strings (see Quoted Printable).

Of course. But the encoding of the data is not under my control, as I explained.

Quoted-printable is defined in RFC2045, which has this to say about my case:

(2)   An "=" followed by a character that is neither a
      hexadecimal digit (including "abcdef") nor the CR
      character of a CRLF pair is illegal.  This case can be
      the result of US-ASCII text having been included in a
      quoted-printable part of a message without itself
      having been subjected to quoted-printable encoding.  A
      reasonable approach by a robust implementation might be
      to include the "=" character and the following
      character in the decoded data without any
      transformation and, if possible, indicate to the user
      that proper decoding was not possible at this point in
      the data.

A robust implementation of DecodeQuotedPrintable sounds good to me.

But then it is not a quoted printable string, so don’t use DecodeQuotedPrintable.

So if you are not sure that it truly is EncodeQuotedPrintable, then force it to be so

I hate dealing with TEXT so I converted to strings for demo purpose…

This shows the “wrong” text output value

  dim  intext, dmp As string, i, lim As Integer, outtext As string
  intext  = "<html><head><style type=""text/css"">"
  outtext = DecodeQuotedPrintable((intext)
  msgbox "IN="+intext+"  Out="+outtext

but this one shows the correct output value

  dim  intext, dmp As string, i, lim As Integer, outtext As string
  intext  = "<html><head><style type=""text/css"">"
  outtext = DecodeQuotedPrintable(EncodeQuotedPrintable(intext))
  msgbox "IN="+intext+"  Out="+outtext

If intext is already a properly quoted printable string, this will not work.

What Tim would like is that the DecodeQuotedPrintable function throws an exception when an equal sign is not followed by [0123456789ABCDEF\r]. I think he should file a feature request. At least when DecodeQuotedPrintable will be implemented for the new framework, it should.

[quote=282029:@Eli Ott]If intext is already a properly quoted printable string, this will not work.

What Tim would like is that the DecodeQuotedPrintable function throws an exception when an equal sign is not followed by [0123456789ABCDEF\r].[/quote]

Nearly right. I don’t want it to throw an exception, what’s the point of that. I want it to behave as recommended by RFC2045, as I posted above.

The point is that my code is not in general going to know whether a supplied string is actually properly quoted printable; it might be, it might not be. All I know is that the remote end claims that it is. But I need to be able to call a function which will behave sensibly whether it is or not.

RFC2045 says it should (if possible):

And how would it signal that? Through an exception of course.

You will have to write your own decode function or you need to test the string for valid characters after each = character before applying DecodeQuotedPrintable.

And you should file this enhancement in Feedback as feature request.

A sensible way to do it is as with ConvertDataToText and ConvertTextToData, which allow for an exception unless you set the flag to allow suppression of the error. I already have to do this because sometimes the remote end lies about which encoding is being used.

So something like:

DecodeQuotedPrintable (mytest, flag)

is what I shall ask for as a feature request. Meanwhile I have indeed written my own following the suggestions in RFC2045, and it works very satisfactorily.