Patch: Improve recognition of attachment file name

View: New views
11 Messages — Rating Filter:   Alert me  

Patch: Improve recognition of attachment file name

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Greetings, Mr. Barry Warsaw and all other Pythonistas,

How do you like this little patch?

$ svn diff
Index: message.py
===================================================================
--- message.py  (revision 60758)
+++ message.py  (working copy)
@@ -671,7 +671,10 @@
         filename = self.get_param('filename', missing,
'content-disposition')
         if filename is missing:
             filename = self.get_param('name', missing,
'content-disposition')
+        # nando: Some messages specify the file name of attachment this
way:
         if filename is missing:
+            filename = self.get_param('name', missing, 'content-type')
+        if filename is missing:
             return failobj
         return utils.collapse_rfc2231_value(filename).strip()


This is the first time I collaborate this way, so if there is anything
else I can do to help, let me know, cause I am sort of ignorant.

--
Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Patch: Improve recognition of attachment file name, with encodings

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have a second suggestion to that same Message.get_filename() method.

It needs to understand filenames that come with text encodings.

The proposed patch is in the attached text file.

Thank you for your time...

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil



Nando wrote:

> Greetings, Mr. Barry Warsaw and all other Pythonistas,
>
> How do you like this little patch?
>
> $ svn diff
> Index: message.py
> ===================================================================
> --- message.py  (revision 60758)
> +++ message.py  (working copy)
> @@ -671,7 +671,10 @@
>          filename = self.get_param('filename', missing,
> 'content-disposition')
>          if filename is missing:
>              filename = self.get_param('name', missing,
> 'content-disposition')
> +        # nando: Some messages specify the file name of attachment this
> way:
>          if filename is missing:
> +            filename = self.get_param('name', missing, 'content-type')
> +        if filename is missing:
>              return failobj
>          return utils.collapse_rfc2231_value(filename).strip()
>
>
> This is the first time I collaborate this way, so if there is anything
> else I can do to help, let me know, cause I am sort of ignorant.
>
>  

Index: message.py
===================================================================
--- message.py (revision 60758)
+++ message.py (working copy)
@@ -16,6 +16,7 @@
 import email.charset
 from email import utils
 from email import errors
+from email.header import decode_header
 
 SEMISPACE = '; '
 
@@ -671,8 +672,16 @@
         filename = self.get_param('filename', missing, 'content-disposition')
         if filename is missing:
             filename = self.get_param('name', missing, 'content-disposition')
+        # nando: Some messages specify the file name of attachment this way:
         if filename is missing:
+            filename = self.get_param('name', missing, 'content-type')
+        if filename is missing:
             return failobj
+        """The following line takes care of cases such as this:
+Content-Disposition: attachment;
+  filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?="
+        """
+        filename = decode_header(filename)[0][0]
         return utils.collapse_rfc2231_value(filename).strip()
 
     def get_boundary(self, failobj=None):

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Patch: Improve recognition of attachment file name, with encodings

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nando writes:

 > I have a second suggestion to that same Message.get_filename() method.
 >
 > It needs to understand filenames that come with text encodings.

It does, already, by use of .collapse_rfc2231_value.  That uses RFC
2231 however, not RFC 2047, as you propose.  Use of RFC 2047 encodings
in parameters is specifically forbidden by that standard.

 > +        # nando: Some messages specify the file name of attachment this way:
 >          if filename is missing:
 > +            filename = self.get_param('name', missing, 'content-type')
 > +        if filename is missing:
 >              return failobj
 > +        """The following line takes care of cases such as this:
 > +Content-Disposition: attachment;
 > +  filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?="
 > +        """
 > +        filename = decode_header(filename)[0][0]
 >          return utils.collapse_rfc2231_value(filename).strip()

I feel your pain; Japanese MUAs do this kind of thing all the time,
too.  However, decoding such garbage should not be done without
specific permission from a human user, because it's forbidden by the
standard.
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by jalopyuser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>  > +        # nando: Some messages specify the file name of attachment this way:
>  >          if filename is missing:
>  > +            filename = self.get_param('name', missing, 'content-type')
>  > +        if filename is missing:
>  >              return failobj
>  > +        """The following line takes care of cases such as this:
>  > +Content-Disposition: attachment;
>  > +  filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?="
>  > +        """
>  > +        filename = decode_header(filename)[0][0]
>  >          return utils.collapse_rfc2231_value(filename).strip()
>
> I feel your pain; Japanese MUAs do this kind of thing all the time,
> too.  However, decoding such garbage should not be done without
> specific permission from a human user, because it's forbidden by the
> standard.

Would it be possible to make this a configurable option, so that if
the user enables it, it's done?

Bill
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bill Janssen writes:

 > Would it be possible to make this a configurable option, so that if
 > the user enables it, it's done?

I don't like it at all, but it has to be on the table, because I get
such malformed messages daily.  I don't think it's going to stop.
Users of the email module are going to want to read their mail,
they're going to want to read the file names, so they know where
they're saving attachments and what the content probably is.

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now
it is off by default, but can be enabled by flipping a flag. I have
updated the docstring for the get_filename() method. Let me know if I am
forgetting something.

Two questions:

1) I have done this for the get_filename() method only. The flag that
needs to be set is called *garbage_filename_decoding*. Look, it says
"filename" in there. But are there any other parameters where the
improper usage of RFC 2047 also commonly occurs? If so, maybe a single
flag for all of them would be more appropriate...

2) Is there some flaw in decode_header()? Something that Thunderbird
displays as "Eduardo & Mônica" is being decoded with the wrong character
in place of the ô:
repr(decode_header(m["subject"])[0][0])
'Eduardo & M\xf4nica'
The header being tested is:
Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
In case we are again doing the Right Thing, then why does Thunderbird
display it the way it was intended?

I am not familiar with the RFCs. When I read Stephen Turnbull's message
explaining that these are in fact malformed messages, I was very
worried. (I want the email library to just work...) Fortunately we can
do the right thing by default, while still supporting decoding of the
malformed messages.

I hope you can approve this small patch...

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil



Stephen J. Turnbull wrote:

> Bill Janssen writes:
>
>  > Would it be possible to make this a configurable option, so that if
>  > the user enables it, it's done?
>
> I don't like it at all, but it has to be on the table, because I get
> such malformed messages daily.  I don't think it's going to stop.
> Users of the email module are going to want to read their mail,
> they're going to want to read the file names, so they know where
> they're saving attachments and what the content probably is.
>  
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Looks like I forgot to attach the patch. Sorry. Here it is.

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil



Nando wrote:

> Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now
> it is off by default, but can be enabled by flipping a flag. I have
> updated the docstring for the get_filename() method. Let me know if I am
> forgetting something.
>
> Two questions:
>
> 1) I have done this for the get_filename() method only. The flag that
> needs to be set is called *garbage_filename_decoding*. Look, it says
> "filename" in there. But are there any other parameters where the
> improper usage of RFC 2047 also commonly occurs? If so, maybe a single
> flag for all of them would be more appropriate...
>
> 2) Is there some flaw in decode_header()? Something that Thunderbird
> displays as "Eduardo & Mônica" is being decoded with the wrong character
> in place of the ô:
> repr(decode_header(m["subject"])[0][0])
> 'Eduardo & M\xf4nica'
> The header being tested is:
> Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
> In case we are again doing the Right Thing, then why does Thunderbird
> display it the way it was intended?
>
> I am not familiar with the RFCs. When I read Stephen Turnbull's message
> explaining that these are in fact malformed messages, I was very
> worried. (I want the email library to just work...) Fortunately we can
> do the right thing by default, while still supporting decoding of the
> malformed messages.
>
> I hope you can approve this small patch...
>  

Index: message.py
===================================================================
--- message.py (revision 60758)
+++ message.py (working copy)
@@ -16,6 +16,7 @@
 import email.charset
 from email import utils
 from email import errors
+from email.header import decode_header
 
 SEMISPACE = '; '
 
@@ -103,6 +104,7 @@
         self._unixfrom = None
         self._payload = None
         self._charset = None
+        self.garbage_filename_decoding = False
         # Defaults for multipart messages
         self.preamble = self.epilogue = None
         self.defects = []
@@ -665,14 +667,27 @@
         The filename is extracted from the Content-Disposition header's
         `filename' parameter, and it is unquoted.  If that header is missing
         the `filename' parameter, this method falls back to looking for the
-        `name' parameter.
+        `name' parameter. Failing this, the Content-Type header is checked
+        for its `name' parameter.
+        
+        RFC 2231 determines the way in which parameters should be encoded.
+        It specifically forbids the use of RFC 2047 encodings in parameters.
+        (RFC 2047 deals with encoding of headers.) However, a few
+        mail agents do exactly the wrong thing. You can enable RFC 2047
+        decoding of filenames by flipping an instance flag:
+        
+            my_message.garbage_filename_decoding = True
         """
         missing = object()
         filename = self.get_param('filename', missing, 'content-disposition')
         if filename is missing:
             filename = self.get_param('name', missing, 'content-disposition')
         if filename is missing:
+            filename = self.get_param('name', missing, 'content-type')
+        if filename is missing:
             return failobj
+        if self.garbage_filename_decoding:
+            filename = decode_header(filename)[0][0]
         return utils.collapse_rfc2231_value(filename).strip()
 
     def get_boundary(self, failobj=None):

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

OK, I get question number 2 now. My question was:

2) Is there some flaw in decode_header()? Something that Thunderbird
displays as "Eduardo & Mônica" is being decoded with the wrong character
in place of the ô:
repr(decode_header(m["subject"])[0][0])
'Eduardo & M\xf4nica'
The header being tested is:
Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
In case we are again doing the Right Thing, then why does Thunderbird
display it the way it was intended?


The answer is I have to use codecs.decode():

import codecs

In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=")

In [21]: s
Out[21]: 'P\xf4nei'

In [22]: encoding
Out[22]: 'iso-8859-1'

In [23]: print codecs.decode(s, encoding)
Pônei

Well, that just makes it even harder to use the return value of the
decode_header() function. And instead of encapsulating all that
complexity in the email library, you are forcing every user of the
library to find all this out by himself, just as I had to.

This is very un-MartinFowler-like, if you pardon that expression :p

I understand Stephen Turnbull's point that it is useful to map the
Message class to RFC 2822, because some users need that. However, that
is not what *I* need - I want a high-level email library, and I am sure
many others do too.

Other mail libraries have faced the challenges of encodings before. I
don't really see why we in Python should hide from that can o'worms (as
Hans-Peter Jansen put it). It is a dirty job, but someone gotta do it!

"How would you handle a mixture of say: big5, euc_jp, koi8_r _and_ utf-8
encodings?"

Well I don't know what the flabbergast you are talking about, but:

Are you scared?

Why should the application developer have to deal with something that
you e-mail experts are much more qualified to implement?

What is it, are you afraid of having a module accused of being "buggy"?
(If so, you know very well that this is not the free software way.)

What about code reuse? Did you see how much I had to do just in order to
print a Subject header?

I do think that a Message subclass (HighLevelMessage?) could play this
role nicely - a high-level interface. Has anyone done this before? (It
is a very obvious idea.) Is anybody else interested at all? Most of the
vibes I get here are like "don't do this, don't do that"...

Thanks to Mark Shapiro for showing me a way to do what I want.

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Nando-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You have succeeded in confusing me to the point I don't know whether my
own proposed patch is useful anymore.

It would seem to mean a change in philosophy, from "just implement the
standards closely" to "give the application developer what he needs".
But this goal would only be accomplished with many alterations to the
codebase.

So if this is true:

Stephen J. Turnbull wrote:
>  > 1) Why does not it currently return the *decoded* header?
>
> Because Message is an implementation of RFC 2822, which says nothing
> about decoding headers.  It is very helpful to model your programs
> directly on the standards the claim to conform to.
>  
I cannot see any documentation stating that the way of this project is
to have each class implement one RFC. If there *were* a writeup on this
somewhere, maybe I wouldn't have annoyed you with so many questions and
absurd propositions.

But why would you lie to me? So it must be true...

Then I don't agree with my own patch anymore.

Anyway, the necessity of a high-level interface remains and nobody has
answered my question: whither?

Nando Florestan
===============
[skype]    nandoflorestan
[phone]  + 55 (11) 3675-3038
[mobile] + 55 (11) 9820-5451
[internet] http://oui.com.br/
[À Capela] http://acapela.com.br/
[location] São Paulo - SP - Brasil

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nando writes:

 > > Because Message is an implementation of RFC 2822, which says nothing
 > > about decoding headers.  It is very helpful to model your programs
 > > directly on the standards the claim to conform to.

 > I cannot see any documentation stating that the way of this project is
 > to have each class implement one RFC. If there *were* a writeup on this
 > somewhere, maybe I wouldn't have annoyed you with so many questions and
 > absurd propositions.

But I didn't say that it was a general principle.  I said that Message
is an implementation of RFC 2822.  For historical reasons it lives in
the rfc822 module.  Its main docstring says:

  RFC 2822 message manipulation.

  Note: This is only a very rough sketch of a full RFC-822 parser; in
  particular the tokenizing of addresses does not adhere to all the
  quoting rules.

  Note: RFC 2822 is a long awaited update to RFC 822.  This module
  should conform to RFC 2822, and is thus mis-named (it's not worth
  renaming it).  Some effort at RFC 2822 updates have been made, but a
  thorough audit has not been performed.  Consider any RFC 2822
  non-conformance to be a bug.

 > Anyway, the necessity of a high-level interface remains and nobody has
 > answered my question: whither?

As I wrote earlier, I think writing a *general* high-level interface
is going to be very hard.  I don't have time to devote to it.  If you
have a set of use cases that have a lot of common needs, then you can
and should write a module to serve those needs.  You'll probably find
other people with similar needs, and some of them will help.

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: Patch: Improve recognition of attachment file name, with encodings

by Stuart Bishop :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nando wrote:

> OK, I get question number 2 now. My question was:
>
> 2) Is there some flaw in decode_header()? Something that Thunderbird
> displays as "Eduardo & Mônica" is being decoded with the wrong character
> in place of the ô:
> repr(decode_header(m["subject"])[0][0])
> 'Eduardo & M\xf4nica'
> The header being tested is:
> Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?=
> In case we are again doing the Right Thing, then why does Thunderbird
> display it the way it was intended?
>
>
> The answer is I have to use codecs.decode():
>
> import codecs
>
> In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=")
>
> In [21]: s
> Out[21]: 'P\xf4nei'
>
> In [22]: encoding
> Out[22]: 'iso-8859-1'
>
> In [23]: print codecs.decode(s, encoding)
> Pônei
>
> Well, that just makes it even harder to use the return value of the
> decode_header() function. And instead of encapsulating all that
> complexity in the email library, you are forcing every user of the
> library to find all this out by himself, just as I had to.
It gets harder, as you are not handling Unicode domain names. Code to
convert email addresses between their ASCII and Unicode representations can
be found at http://stuartbishop.net/Software/EmailAddress/

(Barry - we should discuss getting code to do this into the standard library
again. I think I opened a bug on this soon after I wrote it - in 2004!)

It is a bit of a learning curve, and I suspect that most users of the
library have written the same or similar helpers, possibly several times.
eg. the nearly mandatory header decoder:

def decode_header(s):
    '''Decode an RFC2047 email header into a Unicode string.'''
    s = email.Header.decode_header(s)
    s = [b[0].decode(b[1] or 'ascii') for b in s]
    return u''.join(s)


--
Stuart Bishop <stuart@...>
http://www.stuartbishop.net/



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (196 bytes) Download Attachment