|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
|
|
Patch: Improve recognition of attachment file nameGreetings, Mr. Barry Warsaw and all other Pythonistas,
How do you like this little patch? $ svn diff Index: message.py =================================================================== --- message.py (revision 60758) +++ message.py (working copy) @@ -671,7 +671,10 @@ filename = self.get_param('filename', missing, 'content-disposition') if filename is missing: filename = self.get_param('name', missing, 'content-disposition') + # nando: Some messages specify the file name of attachment this way: if filename is missing: + filename = self.get_param('name', missing, 'content-type') + if filename is missing: return failobj return utils.collapse_rfc2231_value(filename).strip() This is the first time I collaborate this way, so if there is anything else I can do to help, let me know, cause I am sort of ignorant. -- Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Patch: Improve recognition of attachment file name, with encodingsI have a second suggestion to that same Message.get_filename() method.
It needs to understand filenames that come with text encodings. The proposed patch is in the attached text file. Thank you for your time... Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil Nando wrote: > Greetings, Mr. Barry Warsaw and all other Pythonistas, > > How do you like this little patch? > > $ svn diff > Index: message.py > =================================================================== > --- message.py (revision 60758) > +++ message.py (working copy) > @@ -671,7 +671,10 @@ > filename = self.get_param('filename', missing, > 'content-disposition') > if filename is missing: > filename = self.get_param('name', missing, > 'content-disposition') > + # nando: Some messages specify the file name of attachment this > way: > if filename is missing: > + filename = self.get_param('name', missing, 'content-type') > + if filename is missing: > return failobj > return utils.collapse_rfc2231_value(filename).strip() > > > This is the first time I collaborate this way, so if there is anything > else I can do to help, let me know, cause I am sort of ignorant. > > Index: message.py =================================================================== --- message.py (revision 60758) +++ message.py (working copy) @@ -16,6 +16,7 @@ import email.charset from email import utils from email import errors +from email.header import decode_header SEMISPACE = '; ' @@ -671,8 +672,16 @@ filename = self.get_param('filename', missing, 'content-disposition') if filename is missing: filename = self.get_param('name', missing, 'content-disposition') + # nando: Some messages specify the file name of attachment this way: if filename is missing: + filename = self.get_param('name', missing, 'content-type') + if filename is missing: return failobj + """The following line takes care of cases such as this: +Content-Disposition: attachment; + filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?=" + """ + filename = decode_header(filename)[0][0] return utils.collapse_rfc2231_value(filename).strip() def get_boundary(self, failobj=None): _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Patch: Improve recognition of attachment file name, with encodingsNando writes:
> I have a second suggestion to that same Message.get_filename() method. > > It needs to understand filenames that come with text encodings. It does, already, by use of .collapse_rfc2231_value. That uses RFC 2231 however, not RFC 2047, as you propose. Use of RFC 2047 encodings in parameters is specifically forbidden by that standard. > + # nando: Some messages specify the file name of attachment this way: > if filename is missing: > + filename = self.get_param('name', missing, 'content-type') > + if filename is missing: > return failobj > + """The following line takes care of cases such as this: > +Content-Disposition: attachment; > + filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?=" > + """ > + filename = decode_header(filename)[0][0] > return utils.collapse_rfc2231_value(filename).strip() I feel your pain; Japanese MUAs do this kind of thing all the time, too. However, decoding such garbage should not be done without specific permission from a human user, because it's forbidden by the standard. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodings> > + # nando: Some messages specify the file name of attachment this way:
> > if filename is missing: > > + filename = self.get_param('name', missing, 'content-type') > > + if filename is missing: > > return failobj > > + """The following line takes care of cases such as this: > > +Content-Disposition: attachment; > > + filename="=?ISO-8859-1?Q?z=C7D-_Zoltan=5Fchunk=5F5.wmv?=" > > + """ > > + filename = decode_header(filename)[0][0] > > return utils.collapse_rfc2231_value(filename).strip() > > I feel your pain; Japanese MUAs do this kind of thing all the time, > too. However, decoding such garbage should not be done without > specific permission from a human user, because it's forbidden by the > standard. Would it be possible to make this a configurable option, so that if the user enables it, it's done? Bill _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsBill Janssen writes:
> Would it be possible to make this a configurable option, so that if > the user enables it, it's done? I don't like it at all, but it has to be on the table, because I get such malformed messages daily. I don't think it's going to stop. Users of the email module are going to want to read their mail, they're going to want to read the file names, so they know where they're saving attachments and what the content probably is. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsDecoding of RFC 2047 encoded filenames... I attach an updated patch. Now
it is off by default, but can be enabled by flipping a flag. I have updated the docstring for the get_filename() method. Let me know if I am forgetting something. Two questions: 1) I have done this for the get_filename() method only. The flag that needs to be set is called *garbage_filename_decoding*. Look, it says "filename" in there. But are there any other parameters where the improper usage of RFC 2047 also commonly occurs? If so, maybe a single flag for all of them would be more appropriate... 2) Is there some flaw in decode_header()? Something that Thunderbird displays as "Eduardo & Mônica" is being decoded with the wrong character in place of the ô: repr(decode_header(m["subject"])[0][0]) 'Eduardo & M\xf4nica' The header being tested is: Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= In case we are again doing the Right Thing, then why does Thunderbird display it the way it was intended? I am not familiar with the RFCs. When I read Stephen Turnbull's message explaining that these are in fact malformed messages, I was very worried. (I want the email library to just work...) Fortunately we can do the right thing by default, while still supporting decoding of the malformed messages. I hope you can approve this small patch... Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil Stephen J. Turnbull wrote: > Bill Janssen writes: > > > Would it be possible to make this a configurable option, so that if > > the user enables it, it's done? > > I don't like it at all, but it has to be on the table, because I get > such malformed messages daily. I don't think it's going to stop. > Users of the email module are going to want to read their mail, > they're going to want to read the file names, so they know where > they're saving attachments and what the content probably is. > Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsLooks like I forgot to attach the patch. Sorry. Here it is.
Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil Nando wrote: > Decoding of RFC 2047 encoded filenames... I attach an updated patch. Now > it is off by default, but can be enabled by flipping a flag. I have > updated the docstring for the get_filename() method. Let me know if I am > forgetting something. > > Two questions: > > 1) I have done this for the get_filename() method only. The flag that > needs to be set is called *garbage_filename_decoding*. Look, it says > "filename" in there. But are there any other parameters where the > improper usage of RFC 2047 also commonly occurs? If so, maybe a single > flag for all of them would be more appropriate... > > 2) Is there some flaw in decode_header()? Something that Thunderbird > displays as "Eduardo & Mônica" is being decoded with the wrong character > in place of the ô: > repr(decode_header(m["subject"])[0][0]) > 'Eduardo & M\xf4nica' > The header being tested is: > Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= > In case we are again doing the Right Thing, then why does Thunderbird > display it the way it was intended? > > I am not familiar with the RFCs. When I read Stephen Turnbull's message > explaining that these are in fact malformed messages, I was very > worried. (I want the email library to just work...) Fortunately we can > do the right thing by default, while still supporting decoding of the > malformed messages. > > I hope you can approve this small patch... > Index: message.py =================================================================== --- message.py (revision 60758) +++ message.py (working copy) @@ -16,6 +16,7 @@ import email.charset from email import utils from email import errors +from email.header import decode_header SEMISPACE = '; ' @@ -103,6 +104,7 @@ self._unixfrom = None self._payload = None self._charset = None + self.garbage_filename_decoding = False # Defaults for multipart messages self.preamble = self.epilogue = None self.defects = [] @@ -665,14 +667,27 @@ The filename is extracted from the Content-Disposition header's `filename' parameter, and it is unquoted. If that header is missing the `filename' parameter, this method falls back to looking for the - `name' parameter. + `name' parameter. Failing this, the Content-Type header is checked + for its `name' parameter. + + RFC 2231 determines the way in which parameters should be encoded. + It specifically forbids the use of RFC 2047 encodings in parameters. + (RFC 2047 deals with encoding of headers.) However, a few + mail agents do exactly the wrong thing. You can enable RFC 2047 + decoding of filenames by flipping an instance flag: + + my_message.garbage_filename_decoding = True """ missing = object() filename = self.get_param('filename', missing, 'content-disposition') if filename is missing: filename = self.get_param('name', missing, 'content-disposition') if filename is missing: + filename = self.get_param('name', missing, 'content-type') + if filename is missing: return failobj + if self.garbage_filename_decoding: + filename = decode_header(filename)[0][0] return utils.collapse_rfc2231_value(filename).strip() def get_boundary(self, failobj=None): _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsOK, I get question number 2 now. My question was:
2) Is there some flaw in decode_header()? Something that Thunderbird displays as "Eduardo & Mônica" is being decoded with the wrong character in place of the ô: repr(decode_header(m["subject"])[0][0]) 'Eduardo & M\xf4nica' The header being tested is: Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= In case we are again doing the Right Thing, then why does Thunderbird display it the way it was intended? The answer is I have to use codecs.decode(): import codecs In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=") In [21]: s Out[21]: 'P\xf4nei' In [22]: encoding Out[22]: 'iso-8859-1' In [23]: print codecs.decode(s, encoding) Pônei Well, that just makes it even harder to use the return value of the decode_header() function. And instead of encapsulating all that complexity in the email library, you are forcing every user of the library to find all this out by himself, just as I had to. This is very un-MartinFowler-like, if you pardon that expression :p I understand Stephen Turnbull's point that it is useful to map the Message class to RFC 2822, because some users need that. However, that is not what *I* need - I want a high-level email library, and I am sure many others do too. Other mail libraries have faced the challenges of encodings before. I don't really see why we in Python should hide from that can o'worms (as Hans-Peter Jansen put it). It is a dirty job, but someone gotta do it! "How would you handle a mixture of say: big5, euc_jp, koi8_r _and_ utf-8 encodings?" Well I don't know what the flabbergast you are talking about, but: Are you scared? Why should the application developer have to deal with something that you e-mail experts are much more qualified to implement? What is it, are you afraid of having a module accused of being "buggy"? (If so, you know very well that this is not the free software way.) What about code reuse? Did you see how much I had to do just in order to print a Subject header? I do think that a Message subclass (HighLevelMessage?) could play this role nicely - a high-level interface. Has anyone done this before? (It is a very obvious idea.) Is anybody else interested at all? Most of the vibes I get here are like "don't do this, don't do that"... Thanks to Mark Shapiro for showing me a way to do what I want. Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsYou have succeeded in confusing me to the point I don't know whether my
own proposed patch is useful anymore. It would seem to mean a change in philosophy, from "just implement the standards closely" to "give the application developer what he needs". But this goal would only be accomplished with many alterations to the codebase. So if this is true: Stephen J. Turnbull wrote: > > 1) Why does not it currently return the *decoded* header? > > Because Message is an implementation of RFC 2822, which says nothing > about decoding headers. It is very helpful to model your programs > directly on the standards the claim to conform to. > I cannot see any documentation stating that the way of this project is to have each class implement one RFC. If there *were* a writeup on this somewhere, maybe I wouldn't have annoyed you with so many questions and absurd propositions. But why would you lie to me? So it must be true... Then I don't agree with my own patch anymore. Anyway, the necessity of a high-level interface remains and nobody has answered my question: whither? Nando Florestan =============== [skype] nandoflorestan [phone] + 55 (11) 3675-3038 [mobile] + 55 (11) 9820-5451 [internet] http://oui.com.br/ [À Capela] http://acapela.com.br/ [location] São Paulo - SP - Brasil _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsNando writes:
> > Because Message is an implementation of RFC 2822, which says nothing > > about decoding headers. It is very helpful to model your programs > > directly on the standards the claim to conform to. > I cannot see any documentation stating that the way of this project is > to have each class implement one RFC. If there *were* a writeup on this > somewhere, maybe I wouldn't have annoyed you with so many questions and > absurd propositions. But I didn't say that it was a general principle. I said that Message is an implementation of RFC 2822. For historical reasons it lives in the rfc822 module. Its main docstring says: RFC 2822 message manipulation. Note: This is only a very rough sketch of a full RFC-822 parser; in particular the tokenizing of addresses does not adhere to all the quoting rules. Note: RFC 2822 is a long awaited update to RFC 822. This module should conform to RFC 2822, and is thus mis-named (it's not worth renaming it). Some effort at RFC 2822 updates have been made, but a thorough audit has not been performed. Consider any RFC 2822 non-conformance to be a bug. > Anyway, the necessity of a high-level interface remains and nobody has > answered my question: whither? As I wrote earlier, I think writing a *general* high-level interface is going to be very hard. I don't have time to devote to it. If you have a set of use cases that have a lot of common needs, then you can and should write a module to serve those needs. You'll probably find other people with similar needs, and some of them will help. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: Patch: Improve recognition of attachment file name, with encodingsNando wrote:
> OK, I get question number 2 now. My question was: > > 2) Is there some flaw in decode_header()? Something that Thunderbird > displays as "Eduardo & Mônica" is being decoded with the wrong character > in place of the ô: > repr(decode_header(m["subject"])[0][0]) > 'Eduardo & M\xf4nica' > The header being tested is: > Subject: =?iso-8859-1?Q?Eduardo_&_M=F4nica?= > In case we are again doing the Right Thing, then why does Thunderbird > display it the way it was intended? > > > The answer is I have to use codecs.decode(): > > import codecs > > In [20]: [(s, encoding)] = decode_header("=?iso-8859-1?Q?P=F4nei?=") > > In [21]: s > Out[21]: 'P\xf4nei' > > In [22]: encoding > Out[22]: 'iso-8859-1' > > In [23]: print codecs.decode(s, encoding) > Pônei > > Well, that just makes it even harder to use the return value of the > decode_header() function. And instead of encapsulating all that > complexity in the email library, you are forcing every user of the > library to find all this out by himself, just as I had to. convert email addresses between their ASCII and Unicode representations can be found at http://stuartbishop.net/Software/EmailAddress/ (Barry - we should discuss getting code to do this into the standard library again. I think I opened a bug on this soon after I wrote it - in 2004!) It is a bit of a learning curve, and I suspect that most users of the library have written the same or similar helpers, possibly several times. eg. the nearly mandatory header decoder: def decode_header(s): '''Decode an RFC2047 email header into a Unicode string.''' s = email.Header.decode_header(s) s = [b[0].decode(b[1] or 'ascii') for b in s] return u''.join(s) -- Stuart Bishop <stuart@...> http://www.stuartbishop.net/ _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
| Free embeddable forum powered by Nabble | Forum Help |