|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 | Next > |
|
|
|
|
|
Re: fixing the current email moduleOn Oct 3, 2009, at 9:26 AM, Timothy Farrell wrote:
> Back in June, David Murray posted the message below about fixing the > email module. I have an interest in helping with this due to a > personal project I'm working on. However, my ability to help is > severely limited by my understanding of email and MIME RFCs. > > David asked the question of whether or not passing strings to the > feedparser is a needed behavior. I don't claim to have enough > knowledge to answer the question yes or no, but I would urge us all > to consider that if no answer shows up that David's patch should be > put in for no better reason than that it's better than what we > currently have. in the meantime. I firmly believe we need parallel feedparser APIs, one for feeding it strings and one for feeding it bytes. In all the tentative attempts at Python3-ification I've done I just keep coming back to that assessment. I don't think it's a terrible burden either since I also firmly believe that /internally/ the email package should be bytes- oriented. So the basic model is: accept strings or bytes at the edges, process everything internally as bytes, output strings and bytes at the edges. -Barry _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleBarry Warsaw writes:
> So the basic model is: accept strings or bytes at the edges, > process everything internally as bytes, output strings and bytes at > the edges. In a certain pedantic sense, that can't be right, because bytes alone can't represent strings. Practically, you are going need to say how a bytes or bytearray is to be interpreted as a string, and that is going to be one big mess. (MIME?) Going the other way around you have no such problem, or rather the trivial embedding works fine, except that you have to do a range check at some point before you convert to bytes. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
|
|
|
Re: fixing the current email moduleOn approximately 10/3/2009 10:09 AM, came the following characters from
the keyboard of Timothy Farrell: > I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed. > > Forgive my ignorance...why does converting bytes to strings have to be a mess? Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise? If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails. > > If providing the default encoding, no such range check is needed. > > ----- Original Message ----- > From: "Stephen J. Turnbull" <stephen@...> > To: "Barry Warsaw" <barry@...> > Cc: "Timothy Farrell" <tfarrell@...>, email-sig@... > Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central > Subject: Re: [Email-SIG] fixing the current email module > > Barry Warsaw writes: > > > So the basic model is: accept strings or bytes at the edges, > > process everything internally as bytes, output strings and bytes at > > the edges. > > In a certain pedantic sense, that can't be right, because bytes alone > can't represent strings. > > Practically, you are going need to say how a bytes or bytearray is to > be interpreted as a string, and that is going to be one big mess. > (MIME?) > > Going the other way around you have no such problem, or rather the > trivial embedding works fine, except that you have to do a range check > at some point before you convert to bytes. Email messages are bytes. Usually restricted to bytes in the range 32-127, but sometimes permitted to be 0-255 (8bit encoding). Email messages carry sufficient information to convert bytes to strings (usually; and sufficient defaults to cover the other cases adequately, even if not with 100% certainty). So if Barry is considering that the internal form is bytes, particularly bytes encoded via email RFCs, then I can't argue with that being a reasonable internal form.... except for one problem, 2 paragraphs below. The only mess that I can see Stephen referring to is the fact that the email RFCs define rather messy encoding formats and character set specifications. There isn't much cure for this, AFAICS, other than perhaps keeping the bytes in segmented structures, with cached metadata to speed repeated references. Using any other format than email format, means knowing how to translate that format to/from email format, and to/from API format... this means coding two translation routines instead of one. The choice of email RFC byte formats for the internal form makes it quick and easy to produce a complete message when called for, and to defer interpretation when a message is fed in.... sometimes, and herein lies the catch.... One problem with storing messages in bytes format: it seems to me that the choice of which of several legal email bytes formats to represent various email parts (texts and attachments) is problematical for using email format bytes as the internal storage format. An unsophisticated email library could assume that the transfer encoding is always 7bit, and that should be acceptable in all circumstances. A more sophisticated email library would provide support for either 7bit or 8bit transfer encodings.... but the choice of the bytes formats, and MIME type encodings of various message parts to support that difference would be significant. It seems that the present email lib provides only a way to create only a 7bit or 8bit message (and apparently not binary encoding), meaning that the whole message assembly process has to be done after initiating a connection with the SMTP server, to determine whether it supports 8bit (or binary) encoding or not. A more abstract internal format could defer that choice to the generate step, keeping items as str or binary blobs prior to that step. IIUC, 7bit requires that text and binary be encoded to remove "difficult" byte values from the byte stream, so choosing quopri or base64 is appropriate at MIME part definition time to make that choice (although an optimal sized choice could be made based on the data), in the event that generate requests 7bit. However, 8bit has no such requirement, it declares that there are no difficult characters except NULL, CR and LF. However, because no 8bit encodings are defined, the (inefficient, 7-bit) quopri or base64 may still have to be used to avoid lines that are too long, and to encode NULL, CR and LF. 8 bit and UTF-8 text containing no NULL characters and no long lines would qualify without encoding. Finally, binary declares that there are no difficult characters at all. Therefore, the quopri or base64 choice could be ignored, and the raw data passed through. Choosing a particular Content-Transport-Encoding as the internal storage format forces transcoding to the other Content-Transport-Encoding values on the fly after connecting to the SMTP server (using an apparently non-existent parameter to the generate method); not supporting on-the-fly transcoding would force the user to choose a particular Content-Ttransport-Encoding up front, requiring the connection to the SMTP server even earlier in the process. I observe that most of my SMTP providers do not support binary transport, but it seems that MS Exchange does. I observe that binary transport is more efficient than 7bit or 8bit. I observe that even with binary transport, the MIME headers must still be in US-ASCII, by definition, so the headers need not be generated differently for different transports... only the Content-Transfer-Encoding, and the content itself, would be affected by deferring that choice to generate time. Perhaps binary transport, with meta-data indicating whether the user prefers quopri or base64 for parts that must be encoded for 7bit or 8bit transport, would be an appropriate storage format for the email library. This would allow the quopri or base64 encodings to be performed on-the-fly, only if needed, by adding a new parameter to generate, that specifies the Content-Transfer-Encoding (which should default to 7bit for maximal server compatibility, or 8bit if the user specified that along the way so that backwards compatibility is preserved). N.B. I note that the documentation for 2.6.3 section 19.1.3 MIMEtext function (reproduced below) is confusing: /class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])¶ <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText> Module: email.mime.text A subclass of MIMENonMultipart <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>, the MIMEText <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText> class is used to create MIME objects of major type /text/. /_text/ is the string for the payload. /_subtype/ is the minor type and defaults to /plain/. /_charset/ is the character set of the text and is passed as a parameter to the MIMENonMultipart <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart> constructor; it defaults to us-ascii. No guessing or encoding is performed on the text data. Changed in version 2.4: The previously deprecated /_encoding/ argument has been removed. Encoding happens implicitly based on the /_charset/ argument. The confusion is that it states there is no encoding performed, and then it states that encoding is implicit. It is not clear what it actually does, if anything. The 3.2a0 documentation further muddies the water by removing the last paragraph. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleIn the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects. Glenn Linderman writes: > Email messages are bytes. Usually restricted to bytes in the range > 32-127, but sometimes permitted to be 0-255 (8bit encoding). This is irrelevant to our internal representation. It is both trivial and efficient to convert the wire format (bytes) to a string internally (at least for email messages up to say 5MB). Which internal representation makes the most sense depends on what we are going to do with that internal representation. At this point I'm not sure that strings are better than bytes, but I'm quite sure that I've seen no convincing argument that bytes are TOOWTDI. Nor is it at all obvious to me that should be stored in wire format. > Using any other format than email format, means knowing how to > translate that format to/from email format, and to/from API > format... this means coding two translation routines instead of > one. That sound reasonable, but it's a false economy. The formats you're talking about here are the transfer encodings, and we need to be able to decode all of them, and produce all of them. Internally, they can be represented by a single format, so you need internal-to-transfer and transfer-to-internal for about six of them (7bit, 8bit, binary == Python bytes, BASE64, quoted-printable, Python string) As for runtime economy, if conversion is done once at parse time and once at generate time it is not a big burden, not as compared to the overhead of the Python language itself. > The choice of email RFC byte formats By "byte format", do you mean "wire format"? > for the internal form makes it quick and easy to produce a complete > message when called for, Only for certain kinds of messages, such as automated forwards and signed MIME parts, and cron's messages. For those, there are great advantages to spewing things verbatim as you got them off the wire or the disk. But even there, as long as we use the natural embedding of bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not particularly inefficient to use strings. For anything else, storing in wire format is going to require checking format (of the stored data if the format is variable, and always of the requesting API) on all attribute accesses, and conversions on many, even most attribute accesses. > One problem with storing messages in bytes format: it seems to me that > the choice of which of several legal email bytes formats None of them are very happy. The email module needs to be able to both read and produce all of 7bit, 8bit, and binary, and they are in fact pretty well trivial to do. So the question to me is "what are the primary use cases for the email module, and how do they affect the choice of internal representation?" I can't claim special expertise on "how", I'll leave that up to Barry. Here are some use cases I can think of. 1. Debugging programs using the email module. Maybe that's a +1 for internally storing textual data in string form. 2. MUA #1: Composition. Input will be strings and multimedia file names, output will be bytes. Will attributes of message objects be manipulated? Not in a conventional MUA, but an email-based MUA might find uses for that. 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP data). Could be strings, though, depending on the internal format of folders. Output will be strings and multimedia objects. Lots of string processing, especially generating folder directory displays from message headers. 4. Mailing list processor. Message input will be bytes. Configuration input, including heading and footer texts that may be added are likely to be strings. Header manipulation (adding topics, sequence numbers, RFC 2369 headers) most conveniently done with strings. Output will be bytes. 5. Mailing list archiver. Input will be bytes or message objects, output will be strings (typically HTML documents or XML fragments). 6. Spam/virus detection. Input may be bytes or message objects. Lots of internal string processing; in most cases the text/* parts need to be converted to strings before grepping; in some cases even images or executables may be reconstituted to look for malware signatures. Output may be a flag or signal, or the message itself may be edited (typically to provide headers recording degree of spamminess, trace headers, maybe a body heading; in some cases, a new message may be generated with the suspected spam as a message/rfc822 MIME body part). _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn approximately 10/6/2009 7:18 AM, came the following characters from
the keyboard of Stephen J. Turnbull: > In the following I use Python 3 terminology: strings are Python > Unicode objects, and bytes are Python bytes objects. > > Glenn Linderman writes: > > > Email messages are bytes. Usually restricted to bytes in the range > > 32-127, but sometimes permitted to be 0-255 (8bit encoding). > > This is irrelevant to our internal representation. It is both trivial > and efficient to convert the wire format (bytes) to a string > internally (at least for email messages up to say 5MB). > > Which internal representation makes the most sense depends on what we > are going to do with that internal representation. At this point I'm > not sure that strings are better than bytes, but I'm quite sure that > I've seen no convincing argument that bytes are TOOWTDI. > > Nor is it at all obvious to me that should be stored in wire format. > Yes, I interpreted, possibly misinterpreted, Barry's comment about storing things as bytes, as that he was figuring to store them in wire format. > > Using any other format than email format, means knowing how to > > translate that format to/from email format, and to/from API > > format... this means coding two translation routines instead of > > one. > > That sound reasonable, but it's a false economy. And this was actually the point I was trying to make. > The formats you're > talking about here are the transfer encodings, and we need to be able > to decode all of them, and produce all of them. Internally, they can > be represented by a single format, so you need internal-to-transfer > and transfer-to-internal for about six of them (7bit, 8bit, binary == > Python bytes, BASE64, quoted-printable, Python string) > Not all formats apply to all MIME types, but I think you've enumerated the list. > As for runtime economy, if conversion is done once at parse time and > once at generate time it is not a big burden, not as compared to the > overhead of the Python language itself. > I would tend to agree with that, except that if something is received/provided in a particular format, it might want to stay in that format until such time it is needed in a different format... and then the appropriate set of conversions (current format => internal format => needed format) applied as needed, avoiding all conversions when it is already in the needed format. > > The choice of email RFC byte formats > > By "byte format", do you mean "wire format"? > Sure, RFC byte formats == wire format. > > for the internal form makes it quick and easy to produce a complete > > message when called for, > > Only for certain kinds of messages, such as automated forwards and > signed MIME parts, and cron's messages. For those, there are great > advantages to spewing things verbatim as you got them off the wire or > the disk. But even there, as long as we use the natural embedding of > bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not > particularly inefficient to use strings. > two conversions are slower than none, and use 2-4 times the space in string format. > For anything else, storing in wire format is going to require checking > format (of the stored data if the format is variable, and always of > the requesting API) on all attribute accesses, and conversions on > many, even most attribute accesses. > One has to write the conversion code anyway; it is just a matter of where it is called. Once converted, meta data could be retained in its natural format. > > One problem with storing messages in bytes format: it seems to me that > > the choice of which of several legal email bytes formats > > None of them are very happy. The email module needs to be able to > both read and produce all of 7bit, 8bit, and binary, and they are in > fact pretty well trivial to do. > > So the question to me is "what are the primary use cases for the email > module, and how do they affect the choice of internal representation?" > I can't claim special expertise on "how", I'll leave that up to > Barry. Here are some use cases I can think of. > Yes this is a good question. > 1. Debugging programs using the email module. Maybe that's a +1 for > internally storing textual data in string form. > > 2. MUA #1: Composition. Input will be strings and multimedia file > names, output will be bytes. Will attributes of message objects > be manipulated? Not in a conventional MUA, but an email-based MUA > might find uses for that. > I'm not sure what an email-based MUA is.... seems to me even a conventional MUA is "email-based"??? > 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP > data). Could be strings, though, depending on the internal format > of folders. Output will be strings and multimedia objects. Lots > of string processing, especially generating folder directory > displays from message headers. > > 4. Mailing list processor. Message input will be bytes. > Configuration input, including heading and footer texts that may > be added are likely to be strings. Header manipulation (adding > topics, sequence numbers, RFC 2369 headers) most conveniently done > with strings. Output will be bytes. > But the bulk of the message parts, received in wire format, may not need to be altered to be sent along in the same wire format. Headers must be manipulated somehow, I'd think it would be convenient as strings too. Heading and footing texts are configured boilerplate, and could be cached in a variety of formats to avoid the need to convert them for each message, and could then be obtained from the cache in the appropriate format for this particular message, and prepended or appended as appropriate. > 5. Mailing list archiver. Input will be bytes or message objects, > output will be strings (typically HTML documents or XML > fragments). > An archiver could archive wire format, and do the conversions to *ML on the fly for those messages that might be accessed that way. Depends on the expectation of the usage of the archiver... to retrieve the archived messages via email, wire format could be extremely efficient; to retrieve via HTTP, one should note that there is very little difference between .eml format (another name for wire format) and .mthml format (which is a format IE and Opera will display natively, support in other browsers varies, mostly via addons and conversion utilities). So I'm not at all sure that this use case requires string output, although some implementations might prefer it. > 6. Spam/virus detection. Input may be bytes or message objects. > Lots of internal string processing; in most cases the text/* parts > need to be converted to strings before grepping; in some cases > even images or executables may be reconstituted to look for > malware signatures. Output may be a flag or signal, or the > message itself may be edited (typically to provide headers > recording degree of spamminess, trace headers, maybe a body > heading; in some cases, a new message may be generated with the > suspected spam as a message/rfc822 MIME body part). > So it seems to me that storing the data in the format provided, and converting it to native format when requested and caching that result, and then when generating wire format, if the needed format was not provided or cached, then converting as necessary, would be optimal to minimize conversion (time) costs. This technique would also maximally preserve the original format for use cases 3 and 5, which, for use case 3, at least, seems to be important to this list from past discussion. To minimize memory (space) costs, the caching could be avoided (causing reconversion costs), or, at the expense of not preserving the original format, once converted, retain only the native format of the item (which is generally the smallest, for binary objects, and which is most easily manipulated, but not necessarily smallest, for text objects). So I'd design the internal format with meta data like MIMEpart formatFlag metaData 7bitData 8bitData binaryData nativeText nativeBLOB where the metaData would consist of a variety of pertinent items, obtained by decoding provided wireData or supplied along with provided nativeData. Generate could use 7bitData, 8bitData, or binaryData directly if it exists, or cache it there if it didn't already exist. binaryData would differ from nativeBLOB only by containing the appropriate MIMEheaders... perhaps as a space optimization, it would contain only the appropriate MIMEheaders, with the binaryData being placed in nativeBLOB directly (since this is not a costly conversion, just a choice of where to store the bytes). It could also be possible that a complete, provided, wire format message would be retained as a single BLOB, and the appropriate format data items simply be offsets and lengths within that BLOB, although with cached metaData. Of course, there is already a design within the existing code, and the cost of wholesale redesign may be more than can be afforded. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleGlenn Linderman writes:
> Yes, I interpreted, possibly misinterpreted, Barry's comment about > storing things as bytes, as that he was figuring to store them in wire > format. What that means is unclear, though. Does a "header in wire format" mean before or after MIME encoding? Probably after, but that's pretty useless for the purpose of editing the header. Does it include the tag (the part before the colon) or not? Etc. > I would tend to agree with that, except that if something is > received/provided in a particular format, it might want to stay in that > format until such time it is needed in a different format... and then > the appropriate set of conversions (current format => internal format => > needed format) applied as needed, avoiding all conversions when it is > already in the needed format. If you mean that the email module will keep track of what form the object is currently represented by, that will eventually result in "UnicodeError: octet out of range: 161, ascii". > two conversions are slower than none, and use 2-4 times the space in > string format. Let's get this correct, *then* optimize, please. > One has to write the conversion code anyway; it is just a matter of > where it is called. Once converted, meta data could be retained in its > natural format. Meta data for what? Why would you convert meta data? > > 2. MUA #1: Composition. Input will be strings and multimedia file > > names, output will be bytes. Will attributes of message objects > > be manipulated? Not in a conventional MUA, but an email-based MUA > > might find uses for that. > > I'm not sure what an email-based MUA is.... seems to me even a > conventional MUA is "email-based"??? Only if it's written using the Python email module. > > 4. Mailing list processor. Message input will be bytes. > > Configuration input, including heading and footer texts that may > > be added are likely to be strings. Header manipulation (adding > > topics, sequence numbers, RFC 2369 headers) most conveniently done > > with strings. Output will be bytes. > > > > But the bulk of the message parts, received in wire format, may not need > to be altered to be sent along in the same wire format. That depends. For example, multimedia parts may simply be discarded, in which case it makes sense to not convert them. However, most Mailman lists do add a footer, and because of crappy Windows MUAs that don't implement MIME correctly, it's preferred to add that by concatenating as text. That simply cannot be done correctly in wire format for any character set except ISO 8859/1. > Heading and footing texts are configured boilerplate, and could be > cached in a variety of formats to avoid the need to convert them for > each message, Premature optimization is the root of all error. > An archiver could archive wire format, Are you suggesting that the email module should mandate that? We have a severe tail-dog inversion problem here. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleTimothy Farrell <tfarrell@...> wrote:
> Back in June, David Murray posted the message below about fixing the > email module. I have an interest in helping with this due to a > personal project I'm working on. However, my ability to help is > severely limited by my understanding of email and MIME RFCs. Tim, familiarity with email and MIME RFCs would be a big help if you want to help with the email module. Even for writing test cases. Bill _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn approximately 10/6/2009 5:30 PM, came the following characters from
the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > Yes, I interpreted, possibly misinterpreted, Barry's comment about > > storing things as bytes, as that he was figuring to store them in wire > > format. > > What that means is unclear, though. Does a "header in wire format" > mean before or after MIME encoding? Probably after, but that's pretty > useless for the purpose of editing the header. Does it include the > tag (the part before the colon) or not? Etc. > > > I would tend to agree with that, except that if something is > > received/provided in a particular format, it might want to stay in that > > format until such time it is needed in a different format... and then > > the appropriate set of conversions (current format => internal format => > > needed format) applied as needed, avoiding all conversions when it is > > already in the needed format. > > If you mean that the email module will keep track of what form the > object is currently represented by, that will eventually result in > "UnicodeError: octet out of range: 161, ascii". > The above sentence does not communicate your meaning to me... or any meaning, actually. Can you explain? If conversions are avoided, then octets are unlikely to be out of range? And the email module must be aware of the form of the data in order to manipulate it in any format other than wire format, but fortunately, wire format declares the format of the data (not to say there is not buggy wire format data -- but that is an issue best avoided by avoiding as many conversions as possible). > > two conversions are slower than none, and use 2-4 times the space in > > string format. > > Let's get this correct, *then* optimize, please. > That's a nice platitude... I could have used it on you when you said > As for runtime economy, if conversion is done once at parse time and > once at generate time it is not a big burden, not as compared to the > overhead of the Python language itself. but I didn't. You can't design things totally ignoring the reality of time and space performance, and expect to get an efficient result. I agree one can spend too much time on premature optimization issues, and I have that tendency, but if you totally ignore time and space issues, you wind up with Vista. > > One has to write the conversion code anyway; it is just a matter of > > where it is called. Once converted, meta data could be retained in its > > natural format. > > Meta data for what? Why would you convert meta data? > Meta data for the email message... how many MIME parts, their Content-Types, etc. This is small amounts of data, but reasonably likely to referenced multiple times during the message parsing or creation and generation process. So once it is converted from wire format, it should be kept in a useful format, as well as wire format. > > > 2. MUA #1: Composition. Input will be strings and multimedia file > > > names, output will be bytes. Will attributes of message objects > > > be manipulated? Not in a conventional MUA, but an email-based MUA > > > might find uses for that. > > > > I'm not sure what an email-based MUA is.... seems to me even a > > conventional MUA is "email-based"??? > > Only if it's written using the Python email module. > Um. Aren't we talking about use cases for the Python email module? I was trying to interpret what you were saying in that light. Sure, what a conventional (not written using the Python email module) MUA does, is mostly irrelevant, except so far as it shows use cases that might be applied to email-based (written using the Python email module) MUAs. > > > 4. Mailing list processor. Message input will be bytes. > > > Configuration input, including heading and footer texts that may > > > be added are likely to be strings. Header manipulation (adding > > > topics, sequence numbers, RFC 2369 headers) most conveniently done > > > with strings. Output will be bytes. > > > > > > > But the bulk of the message parts, received in wire format, may not need > > to be altered to be sent along in the same wire format. > > That depends. For example, multimedia parts may simply be discarded, > in which case it makes sense to not convert them. However, most > Mailman lists do add a footer, and because of crappy Windows MUAs that > don't implement MIME correctly, it's preferred to add that by > concatenating as text. That simply cannot be done correctly in wire > format for any character set except ISO 8859/1. > Huh? First off, which "crappy Windows MUAs" don't implement MIME correctly, and what do they do wrong? When I look at wire format emails, I'm mostly appalled by the stuff generated by Apple Mail. I have seen a few doozies from Outlook 2000, but they seem to be fixed in newer versions. Adding a header or trailer does require knowledge of the character set and encoding of the message. Given that, you can decode to str, add the header or trailer and encode back to MIME. So that's the inefficient proof of concept. In the identity or quopri encodings, it is possible to add similarly encoded headers and trailers correctly to text/plain parts through normal concatenation. Adding headers to base64 encoding requires that the encoded header be an exact number of base64 lines, or at least a multiple of 3 characters and that you shuffle the line layout through the whole base64 body... it is not clear that this is worth the work. Adding trailers to base64 encoding requires decoding the final partial encoding, noticing how much room is left on that last line, and the encoding from there on... so it is not possible to cache an encoded base64 footer, although it would be possible to cache 3 of them, and only have to tweak the merge and choose the right one of the three and then reshuffle. So since text/plain is seldom encoded in base64, and base64 is so complex to concatenate to in wire format, I'd think it would be a better choice to decode and reencode to concatenate headers or footers to base64 encoded MIME parts.... unless immense base64 encoded MIME parts are expected to be common enough to develop the optimized logic. text/html is trickier, whether encoded or not. You have to parse past any stuff that precedes <body>, and place the header after that, and then you have to find the </body> and place the trailer before that. And unless you run the HTML through a validity checker, you can't be sure that the trailer will even show up, much less actually at the bottom, due to the possibility of unclosed tags within the body. To parse even quopri encoded HTML gets tricky, and basically impossible for base64 encoded HTML. So the first text/html part likely will need to be decoded for adding headers and trailers, if it is an alternative to the text/plain part, or there is no text/plain part. I've seen some systems add an additional MIME part to place a trailer in, and that can be pretty effective for MUAs that will show multiple parts in-line, but there are so many MUAs out there, that it is extremely difficult to make any certain declarations regarding what the user sees as a result. And, ISO 8859/1 is an 8-bit character set, so would require encoding on a 7bit transfer. But it is not unique; if you know how to do ISO 8859/1 concatenation in wire format, then you can do the whole class of ASCII+128 more character sets in the same manner. Not to mention that ASCII itself works fine in wire format. And so does UTF-8. It is just a matter of matching the character set and the encoding. > > Heading and footing texts are configured boilerplate, and could be > > cached in a variety of formats to avoid the need to convert them for > > each message, > > Premature optimization is the root of all error. > Yeah, yeah. I said "could", not "must". I was pushing back from your declaration that: > Configuration input, including heading and footer texts that may > be added are likely to be strings. Such configuration texts are likely to provided as strings, but there is nothing to prevent them from being converted to other formats. Premature optimization may or may not be the root of all error, but discarding perfectly valid design possibilities based on how the input might be supplied seems a similar error. I'm not declaring which design is best, just that there are alternatives. > > An archiver could archive wire format, > > Are you suggesting that the email module should mandate that? We have > a severe tail-dog inversion problem here. Absolutely not. I said "could", not "must". The archiver can do what it wants. The email library should provide access to the message data in all useful formats, so that the archiver can do what it wants. The archiver needs to choose its design and optimizations appropriate for its expected use cases. I was pushing back from your declaration that an archiver would always want string output.... you said: > 5. Mailing list archiver. Input will be bytes or message objects, > output will be strings (typically HTML documents or XML > fragments). -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleGlenn Linderman writes:
> > If you mean that the email module will keep track of what form the > > object is currently represented by, that will eventually result in > > "UnicodeError: octet out of range: 161, ascii". > > The above sentence does not communicate your meaning to me... or any > meaning, actually. Can you explain? Yes, that Unicode error is one that took years for Mailman to work around. If we are going to be converting different objects at different times, I'm sure we'll get to see it agin in the future. Oh, joy. > If conversions are avoided, then octets are unlikely to be out of > range? Haven't looked in your spam bucket recently, I guess. Spammers regularly put 8 bit characters into headers (and into bodies in messages without a Content-Type header), for one thing. > And the email module must be aware of the form of the data in > order to manipulate it in any format other than wire format, but > fortunately, wire format declares the format of the data (not to say > there is not buggy wire format data -- but that is an issue best avoided > by avoiding as many conversions as possible). "Best" I can't speak to; you obviously are willing to accept a much higher error rate than I am. "Robust" handling of buggy wire format data means that the email module must do something sane with it before giving it to the application. Maybe it's reasonable to do that lazily, and/or cache the result, but access to bogus data (that the email module can determine is bogus or suspicious) must not be allowed unless the client says "hit me with your best shot" explicitly. Most clients are simply not going to be prepared for the kind of crap I see in /var/mail/turnbull every day. > I was pushing back from your declaration that an archiver would > always want string output Please don't push back; we won't get anywhere. Use cases are *examples*, not complete specifications of all possible inputs and outputs. Use cases should be simple and clear cut. If you want a different use case, state it. In fact in the real world, *all* of the archivers I know of produce text formats on disk, either deleting multimedia objects or saving them off and linking to them via URLs in the text. If you know of a different kind of archiver, add it as a use case. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
> Haven't looked in your spam bucket recently, I guess. Spammers > regularly put 8 bit characters into headers Legitimate but stupid programs do this as well. Think of phpbb-like forums written by programmers who never understand how non-ascii can be put into Subject field or filenames - they send amazingly crippled emails. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd@... Programmers don't die, they just GOSUB without RETURN. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOleg Broytman writes:
> On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote: > > Haven't looked in your spam bucket recently, I guess. Spammers > > regularly put 8 bit characters into headers > > Legitimate but stupid programs do this as well. Sure, but Glenn may not be subscribed to any of those. *Everybody* is subscribed to spam, though. I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs, _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn Wed, Oct 7, 2009 at 11:19 PM, Stephen J. Turnbull <stephen@...> wrote:
You'd be amazed how many MUAs shipped by major companies are broken. MS Entourage, anyone?
noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on-both-levels-ly y'rs. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email module[Stephen J. Turnbull]
> Yes, that Unicode error is one that took years for Mailman to work > around. If we are going to be converting different objects at > different times, I'm sure we'll get to see it again in the future. In my opinion, the email module should never raise an exception as a result of working with a malformed message. Though it should certainly make the information that a message was malformed available for the calling program to check. That is, I think that it's extremely unlikely that the calling program wants to blow up as a result of a malformed message. Very probably, it wants to make what sense of the message that it can. The number of ways in which a message can be malformed is pretty large and just how (and when, as has been mentioned) any particular error will cause problems for the module is really a matter that's internal to the module. The module's user shouldn't have to say, "Over here I have to trap UnicodeErrors and over there I have to trap IndexErrors". Regards, Matt _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
> In my opinion, the email module should never raise an exception as a > result of working with a malformed message. Though it should > certainly make the information that a message was malformed available > for the calling program to check. I disagree. email package is not a user agent, and exceptions are *the* way to indicate there are problems. > That is, I think that it's extremely unlikely that the calling > program wants to blow up as a result of a malformed message. Then the calling program must catch all exceptions and process they in a reasonable (for this particular application) way. But certainly email package must not dictate what ways are reasonable - they are too application-specific. > Very > probably, it wants to make what sense of the message that it can. Yes, if email parse a message in some way - ok. You can help by creating more intelligent parser(s). But if a parser stumbles upon an unparseable block - it must raises an exception. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd@... Programmers don't die, they just GOSUB without RETURN. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn approximately 10/7/2009 3:33 AM, came the following characters from
the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > > If you mean that the email module will keep track of what form the > > > object is currently represented by, that will eventually result in > > > "UnicodeError: octet out of range: 161, ascii". > > > > The above sentence does not communicate your meaning to me... or any > > meaning, actually. Can you explain? > > Yes, that Unicode error is one that took years for Mailman to work > around. If we are going to be converting different objects at > different times, I'm sure we'll get to see it agin in the future. Oh, > joy. > Ah, a historical remark! So that's why it was lost on me, I'm new to the Python world (but programming since 1975...) > > If conversions are avoided, then octets are unlikely to be out of > > range? > > Haven't looked in your spam bucket recently, I guess. Spammers > regularly put 8 bit characters into headers (and into bodies in > messages without a Content-Type header), for one thing. > I'm aware of that, but if conversions are not done, octets are unlikely to be _reported_ to be out of range.... > > And the email module must be aware of the form of the data in > > order to manipulate it in any format other than wire format, but > > fortunately, wire format declares the format of the data (not to say > > there is not buggy wire format data -- but that is an issue best avoided > > by avoiding as many conversions as possible). > > "Best" I can't speak to; you obviously are willing to accept a much > higher error rate than I am. "Robust" handling of buggy wire format > data means that the email module must do something sane with it before > giving it to the application. Maybe it's reasonable to do that > lazily, and/or cache the result, but access to bogus data (that the > email module can determine is bogus or suspicious) must not be allowed > unless the client says "hit me with your best shot" explicitly. Most > clients are simply not going to be prepared for the kind of crap I see > in /var/mail/turnbull every day. > Are you referring to most email clients, or most Python-email-library-using clients? It seems like most email clients are being hit with the same stuff you are seeing... every day... and are handling it somehow... although anti-spam filters do eliminate some of it before the end user's MUA sees it, depending on the ISP, etc. Is it your point of view, then, that incorrectly formed email should be mostly treated as SPAM? Your paragraph above could be interpreted that way. Oleg's point is also valid though, so it seems that isn't your point of view. Your "hit me with your best shot" comment indicates that you want a failure code or exception when the data is bad, and then a way to "retry accepting errors"? > > I was pushing back from your declaration that an archiver would > > always want string output > > Please don't push back; we won't get anywhere. Use cases are > *examples*, not complete specifications of all possible inputs and > outputs. Use cases should be simple and clear cut. If you want a > different use case, state it. In fact in the real world, *all* of the > archivers I know of produce text formats on disk, either deleting > multimedia objects or saving them off and linking to them via URLs in > the text. If you know of a different kind of archiver, add it as a > use case. > I misunderstood the purpose of your list. Sure, everything in your list is a good example of real world uses. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email module[me]
> In my opinion, the email module should never raise an exception as a > result of working with a malformed message. [Oleg Broytman] > I disagree. email package is not a user agent, and exceptions are > *the* way to indicate there are problems. We may have to agree to disagree. If the email package gives up because a message is malformed, I don't know what exactly it's for. It's certainly not for parsing what arrives in my mailbox. > Then the calling program must catch all exceptions and process they > in a reasonable (for this particular application) way. Then the module's documentation would need to include a list of all exceptions that it might raise and the times that it might raise them. Otherwise the application developer is proceeding in the dark. Regards, Matt _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn Wed, Oct 07, 2009 at 03:05:35PM -0500, Matthew Dixon Cowles wrote:
> If the email package gives up > because a message is malformed, I don't know what exactly it's for. > It's certainly not for parsing what arrives in my mailbox. Then it is *your* task to enhance the code. A flow of patches with tests would be the best contribution. > > Then the calling program must catch all exceptions and process they > > in a reasonable (for this particular application) way. > > Then the module's documentation would need to include a list of all > exceptions that it might raise and the times that it might raise > them. You are also welcome to provide patches for documentation. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd@... Programmers don't die, they just GOSUB without RETURN. _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
|
|
Re: fixing the current email moduleOn Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:
> Barry Warsaw writes: > >> So the basic model is: accept strings or bytes at the edges, >> process everything internally as bytes, output strings and bytes at >> the edges. > > In a certain pedantic sense, that can't be right, because bytes alone > can't represent strings. > > Practically, you are going need to say how a bytes or bytearray is to > be interpreted as a string, and that is going to be one big mess. > (MIME?) > > Going the other way around you have no such problem, or rather the > trivial embedding works fine, except that you have to do a range check > at some point before you convert to bytes. package to Python 3, once using bytes internally and another time using strings internally. Neither one was completely satisfying (to say the least). I've also heard convincing arguments from folks in the Python community in both camps: "using anything other than strings internally is insane; no, using anything other than bytes internally is insane." As for the internal representational format, I'll amend my previous statement and say that I'll keep an open mind, but one thing that seems very clear is that we have to be able to accept strings and bytes at the incoming edges, and produce strings and bytes at the outgoing edges. In a future message, Stephen outlines some excellent use cases, to which I'll follow up when I get there. But I think he generally hits the nail on the head and proves that we'll have both types at the edges. That makes for very interesting API design! There's "internal" and then there's the low-level representation that the model exposes. Here I have more confidence that we need make things much more consistent. The trick is to do that while still making things convenient. For example, we currently represent header values as 8-bit strings or Header instances. The latter can contain triples of the individual chunks, e.g. (content, language, charset). I think we need represent header values as instances in all cases because the type checking is error prone, but even then, it makes for difficult API choices. Still, if the fundamental atom of header values in the model is the Header, and we define both byte and string APIs for headers, then the internal representation matters less since only the email package implementers need to care. But note that even in this limited case, neither bytes nor strings really works. The internal representation is that triple (and in the current model an implicit triple where charset=us-ascii). So internally the charset is carried along for the ride, as it must be. If the internal representation were just strings or bytes, we wouldn't know how to generate the other format, at least not idempotently (or as close as we can get). Just to ramble a little longer, it's been argued that we should give up on idempotency, but I'm not convinced. I think people want to see an email message they throw into the system come out the other end as closely as possible (well, /exactly/ for well-formed messages). -Barry _______________________________________________ Email-SIG mailing list Email-SIG@... Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |