Re: fixing the current email module

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 | Next >

Parent Message unknown Re: fixing the current email module

by Timothy Farrell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Back in June, David Murray posted the message below about fixing the email module.  I have an interest in helping with this due to a personal project I'm working on.  However, my ability to help is severely limited by my understanding of email and MIME RFCs.

David asked the question of whether or not passing strings to the feedparser is a needed behavior.  I don't claim to have enough knowledge to answer the question yes or no, but I would urge us all to consider that if no answer shows up that David's patch should be put in for no better reason than that it's better than what we currently have.

David, if you would send it to me, I might be able to fix up some of the test cases.

Thanks,
-tim


--------
So, designing a new interface is one thing.  Making the current
interface usable in py3k is another.  I presume that the latter
is desirable?

I'm porting a small application that uses the email module to py3k.
I've run into two problems, one of which was already reported, the other
of which was not:

     http://bugs.python.org/issue4661
     http://bugs.python.org/issue6302

(Then there's the whole string issues relating to email and unicode
organized under Issue1685453, but I'm going to ignore those for the
moment.)

I'd like to try fixing these, but there are design issues involved.
The fundamental one is, what format should 'message' be handling message
data in?  4661 addresses this obliquely, and we've talked about this
somewhat at the higher design level.  But the question before me is,
how to fix feedparser, message, and decode_header so that I can actually
parse a message and display it correctly.

I need to be able to feed bytes to feedparser, that much is clear.
I've implemented a proof-of-concept fix that has feedparser handle all
its input as bytes, has message decode headers and values using the
ASCII codec if handled bytes, and has decode_header expect strings and
consistently return bytes.

With this fix in place my application works.  But of course, the
email module tests do not pass, and I don't know what other use
cases I have broken.

My specific question, as posted in issue4661, is: is there any
use case for passing strings to feedparser that is not a design
error waiting to trap the programmer?

--David
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Oct 3, 2009, at 9:26 AM, Timothy Farrell wrote:

> Back in June, David Murray posted the message below about fixing the  
> email module.  I have an interest in helping with this due to a  
> personal project I'm working on.  However, my ability to help is  
> severely limited by my understanding of email and MIME RFCs.
>
> David asked the question of whether or not passing strings to the  
> feedparser is a needed behavior.  I don't claim to have enough  
> knowledge to answer the question yes or no, but I would urge us all  
> to consider that if no answer shows up that David's patch should be  
> put in for no better reason than that it's better than what we  
> currently have.
I expect RDM to have some follow ups soon, but I'll put this forward  
in the meantime.

I firmly believe we need parallel feedparser APIs, one for feeding it  
strings and one for feeding it bytes.  In all the tentative attempts  
at Python3-ification I've done I just keep coming back to that  
assessment.  I don't think it's a terrible burden either since I also  
firmly believe that /internally/ the email package should be bytes-
oriented.  So the basic model is: accept strings or bytes at the  
edges, process everything internally as bytes, output strings and  
bytes at the edges.

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (849 bytes) Download Attachment

Re: fixing the current email module

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Parent Message unknown Re: fixing the current email module

by Timothy Farrell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.

Forgive my ignorance...why does converting bytes to strings have to be a mess?  Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise?  If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.

If providing the default encoding, no such range check is needed.

----- Original Message -----
From: "Stephen J. Turnbull" <stephen@...>
To: "Barry Warsaw" <barry@...>
Cc: "Timothy Farrell" <tfarrell@...>, email-sig@...
Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
Subject: Re: [Email-SIG] fixing the current email module

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Glenn Linderman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On approximately 10/3/2009 10:09 AM, came the following characters from
the keyboard of Timothy Farrell:

> I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.
>
> Forgive my ignorance...why does converting bytes to strings have to be a mess?  Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise?  If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.
>
> If providing the default encoding, no such range check is needed.
>
> ----- Original Message -----
> From: "Stephen J. Turnbull" <stephen@...>
> To: "Barry Warsaw" <barry@...>
> Cc: "Timothy Farrell" <tfarrell@...>, email-sig@...
> Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
> Subject: Re: [Email-SIG] fixing the current email module
>
> Barry Warsaw writes:
>
>  > So the basic model is: accept strings or bytes at the edges,
>  > process everything internally as bytes, output strings and bytes at
>  > the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.

Email messages are bytes.  Usually restricted to bytes in the range
32-127, but sometimes permitted to be 0-255 (8bit encoding).

Email messages carry sufficient information to convert bytes to strings
(usually; and sufficient defaults to cover the other cases adequately,
even if not with 100% certainty).

So if Barry is considering that the internal form is bytes, particularly
bytes encoded via email RFCs, then I can't argue with that being a
reasonable internal form.... except for one problem, 2 paragraphs below.

The only mess that I can see Stephen referring to is the fact that the
email RFCs define rather messy encoding formats and character set
specifications.  There isn't much cure for this, AFAICS, other than
perhaps keeping the bytes in segmented structures, with cached metadata
to speed repeated references.  Using any other format than email format,
means knowing how to translate that format to/from email format, and
to/from API format... this means coding two translation routines instead
of one.

The choice of email RFC byte formats for the internal form makes it
quick and easy to produce a complete message when called for, and to
defer interpretation when a message is fed in.... sometimes, and herein
lies the catch....

One problem with storing messages in bytes format: it seems to me that
the choice of which of several legal email bytes formats to represent
various email parts (texts and attachments) is problematical for using
email format bytes as the internal storage format.  An unsophisticated
email library could assume that the transfer encoding is always 7bit,
and that should be acceptable in all circumstances.  A more
sophisticated email library would provide support for either 7bit or
8bit transfer encodings.... but the choice of the bytes formats, and
MIME type encodings of various message parts to support that difference
would be significant.  It seems that the present email lib provides only
a way to create only a 7bit or 8bit message (and apparently not binary
encoding), meaning that the whole message assembly process has to be
done after initiating a connection with the SMTP server, to determine
whether it supports 8bit (or binary) encoding or not.  A more abstract
internal format could defer that choice to the generate step, keeping
items as str or binary blobs prior to that step.

IIUC, 7bit requires that text and binary be encoded to remove
"difficult" byte values from the byte stream, so choosing quopri or
base64 is appropriate at MIME part definition time to make that choice
(although an optimal sized choice could be made based on the data), in
the event that generate requests 7bit.

However, 8bit has no such requirement, it declares that there are no
difficult characters except NULL, CR and LF.  However, because no 8bit
encodings are defined, the (inefficient, 7-bit) quopri or base64 may
still have to be used to avoid lines that are too long, and to encode
NULL, CR and LF.  8 bit and UTF-8 text containing no NULL characters and
no long lines would qualify without encoding.

Finally, binary declares that there are no difficult characters at all.  
Therefore, the quopri or base64 choice could be ignored, and the raw
data passed through.

Choosing a particular Content-Transport-Encoding as the internal storage
format forces transcoding to the other Content-Transport-Encoding values
on the fly after connecting to the SMTP server (using an apparently
non-existent parameter to the generate method); not supporting
on-the-fly transcoding would force the user to choose a particular
Content-Ttransport-Encoding up front, requiring the connection to the
SMTP server even earlier in the process.

I observe that most of my SMTP providers do not support binary
transport, but it seems that MS Exchange does.

I observe that binary transport is more efficient than 7bit or 8bit.

I observe that even with binary transport, the MIME headers must still
be in US-ASCII, by definition, so the headers need not be generated
differently for different transports... only the
Content-Transfer-Encoding, and the content itself, would be affected by
deferring that choice to generate time.

Perhaps binary transport, with meta-data indicating whether the user
prefers quopri or base64 for parts that must be encoded for 7bit or 8bit
transport, would be an appropriate storage format for the email
library.  This would allow the quopri or base64 encodings to be
performed on-the-fly, only if needed, by adding a new parameter to
generate, that specifies the Content-Transfer-Encoding (which should
default to 7bit for maximal server compatibility, or 8bit if the user
specified that along the way so that backwards compatibility is preserved).


N.B.  I note that the documentation for 2.6.3 section 19.1.3 MIMEtext
function (reproduced below) is confusing:

/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])¶
<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>

    Module: email.mime.text

    A subclass of MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
    the MIMEText
    <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
    class is used to create MIME objects of major type /text/. /_text/
    is the string for the payload. /_subtype/ is the minor type and
    defaults to /plain/. /_charset/ is the character set of the text and
    is passed as a parameter to the MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
    constructor; it defaults to us-ascii. No guessing or encoding is
    performed on the text data.

    Changed in version 2.4: The previously deprecated /_encoding/
    argument has been removed. Encoding happens implicitly based on the
    /_charset/ argument.


The confusion is that it states there is no encoding performed, and then
it states that encoding is implicit.  It is not clear what it actually
does, if anything.  The 3.2a0 documentation further muddies the water by
removing the last paragraph.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.

Glenn Linderman writes:

 > Email messages are bytes.  Usually restricted to bytes in the range
 > 32-127, but sometimes permitted to be 0-255 (8bit encoding).

This is irrelevant to our internal representation.  It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).

Which internal representation makes the most sense depends on what we
are going to do with that internal representation.  At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.

Nor is it at all obvious to me that should be stored in wire format.

 > Using any other format than email format, means knowing how to
 > translate that format to/from email format, and to/from API
 > format... this means coding two translation routines instead of
 > one.

That sound reasonable, but it's a false economy.  The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them.  Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)

As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.

 > The choice of email RFC byte formats

By "byte format", do you mean "wire format"?

 > for the internal form makes it quick and easy to produce a complete
 > message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

 > One problem with storing messages in bytes format: it seems to me that
 > the choice of which of several legal email bytes formats

None of them are very happy.  The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.

1.  Debugging programs using the email module.  Maybe that's a +1 for
    internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
    names, output will be bytes.  Will attributes of message objects
    be manipulated?  Not in a conventional MUA, but an email-based MUA
    might find uses for that.

3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
    data).  Could be strings, though, depending on the internal format
    of folders.  Output will be strings and multimedia objects.  Lots
    of string processing, especially generating folder directory
    displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
    Configuration input, including heading and footer texts that may
    be added are likely to be strings.  Header manipulation (adding
    topics, sequence numbers, RFC 2369 headers) most conveniently done
    with strings.  Output will be bytes.

5.  Mailing list archiver.  Input will be bytes or message objects,
    output will be strings (typically HTML documents or XML
    fragments).

6.  Spam/virus detection.  Input may be bytes or message objects.
    Lots of internal string processing; in most cases the text/* parts
    need to be converted to strings before grepping; in some cases
    even images or executables may be reconstituted to look for
    malware signatures.  Output may be a flag or signal, or the
    message itself may be edited (typically to provide headers
    recording degree of spamminess, trace headers, maybe a body
    heading; in some cases, a new message may be generated with the
    suspected spam as a message/rfc822 MIME body part).

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Glenn Linderman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On approximately 10/6/2009 7:18 AM, came the following characters from
the keyboard of Stephen J. Turnbull:

> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.
>
> Glenn Linderman writes:
>
>  > Email messages are bytes.  Usually restricted to bytes in the range
>  > 32-127, but sometimes permitted to be 0-255 (8bit encoding).
>
> This is irrelevant to our internal representation.  It is both trivial
> and efficient to convert the wire format (bytes) to a string
> internally (at least for email messages up to say 5MB).
>
> Which internal representation makes the most sense depends on what we
> are going to do with that internal representation.  At this point I'm
> not sure that strings are better than bytes, but I'm quite sure that
> I've seen no convincing argument that bytes are TOOWTDI.
>
> Nor is it at all obvious to me that should be stored in wire format.
>  

Yes, I interpreted, possibly misinterpreted, Barry's comment about
storing things as bytes, as that he was figuring to store them in wire
format.


>  > Using any other format than email format, means knowing how to
>  > translate that format to/from email format, and to/from API
>  > format... this means coding two translation routines instead of
>  > one.
>
> That sound reasonable, but it's a false economy.  

And this was actually the point I was trying to make.


> The formats you're
> talking about here are the transfer encodings, and we need to be able
> to decode all of them, and produce all of them.  Internally, they can
> be represented by a single format, so you need internal-to-transfer
> and transfer-to-internal for about six of them (7bit, 8bit, binary ==
> Python bytes, BASE64, quoted-printable, Python string)
>  

Not all formats apply to all MIME types, but I think you've enumerated
the list.

> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
>  

I would tend to agree with that, except that if something is
received/provided in a particular format, it might want to stay in that
format until such time it is needed in a different format... and then
the appropriate set of conversions (current format => internal format =>
needed format) applied as needed, avoiding all conversions when it is
already in the needed format.

>  > The choice of email RFC byte formats
>
> By "byte format", do you mean "wire format"?
>  

Sure, RFC byte formats == wire format.

>  > for the internal form makes it quick and easy to produce a complete
>  > message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages.  For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk.  But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>  

two conversions are slower than none, and use 2-4 times the space in
string format.

> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.
>  

One has to write the conversion code anyway; it is just a matter of
where it is called.  Once converted, meta data could be retained in its
natural format.

>  > One problem with storing messages in bytes format: it seems to me that
>  > the choice of which of several legal email bytes formats
>
> None of them are very happy.  The email module needs to be able to
> both read and produce all of 7bit, 8bit, and binary, and they are in
> fact pretty well trivial to do.
>
> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry.  Here are some use cases I can think of.
>  

Yes this is a good question.


> 1.  Debugging programs using the email module.  Maybe that's a +1 for
>     internally storing textual data in string form.
>
> 2.  MUA #1: Composition.  Input will be strings and multimedia file
>     names, output will be bytes.  Will attributes of message objects
>     be manipulated?  Not in a conventional MUA, but an email-based MUA
>     might find uses for that.
>  

I'm not sure what an email-based MUA is.... seems to me even a
conventional MUA is "email-based"???


> 3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
>     data).  Could be strings, though, depending on the internal format
>     of folders.  Output will be strings and multimedia objects.  Lots
>     of string processing, especially generating folder directory
>     displays from message headers.
>
> 4.  Mailing list processor.  Message input will be bytes.
>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  Header manipulation (adding
>     topics, sequence numbers, RFC 2369 headers) most conveniently done
>     with strings.  Output will be bytes.
>  

But the bulk of the message parts, received in wire format, may not need
to be altered to be sent along in the same wire format.  Headers must be
manipulated somehow, I'd think it would be convenient as strings too.  
Heading and footing texts are configured boilerplate, and could be
cached in a variety of formats to avoid the need to convert them for
each message, and could then be obtained from the cache in the
appropriate format for this particular message, and prepended or
appended as appropriate.

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).
>  

An archiver could archive wire format, and do the conversions to *ML on
the fly for those messages that might be accessed that way.  Depends on
the expectation of the usage of the archiver... to retrieve the archived
messages via email, wire format could be extremely efficient; to
retrieve via HTTP, one should note that there is very little difference
between .eml format (another name for wire format) and .mthml format
(which is a format IE and Opera will display natively, support in other
browsers varies, mostly via addons and conversion utilities).  So I'm
not at all sure that this use case requires string output, although some
implementations might prefer it.

> 6.  Spam/virus detection.  Input may be bytes or message objects.
>     Lots of internal string processing; in most cases the text/* parts
>     need to be converted to strings before grepping; in some cases
>     even images or executables may be reconstituted to look for
>     malware signatures.  Output may be a flag or signal, or the
>     message itself may be edited (typically to provide headers
>     recording degree of spamminess, trace headers, maybe a body
>     heading; in some cases, a new message may be generated with the
>     suspected spam as a message/rfc822 MIME body part).
>  


So it seems to me that storing the data in the format provided, and
converting it to native format when requested and caching that result,
and then when generating wire format, if the needed format was not
provided or cached, then converting as necessary, would be optimal to
minimize conversion (time) costs.  This technique would also maximally
preserve the original format for use cases 3 and 5, which, for use case
3, at least, seems to be important to this list from past discussion.  
To minimize memory (space) costs, the caching could be avoided (causing
reconversion costs), or, at the expense of not preserving the original
format, once converted, retain only the native format of the item (which
is generally the smallest, for binary objects, and which is most easily
manipulated, but not necessarily smallest, for text objects).

So I'd design the internal format with meta data like

MIMEpart
    formatFlag
    metaData
    7bitData
    8bitData
    binaryData
    nativeText
    nativeBLOB

where the metaData would consist of a variety of pertinent items,
obtained by decoding provided wireData or supplied along with provided
nativeData.

Generate could use 7bitData, 8bitData, or binaryData directly if it
exists, or cache it there if it didn't already exist.

binaryData would differ from nativeBLOB only by containing the
appropriate MIMEheaders... perhaps as a space optimization, it would
contain only the appropriate MIMEheaders, with the binaryData being
placed in nativeBLOB directly (since this is not a costly conversion,
just a choice of where to store the bytes).

It could also be possible that a complete, provided, wire format message
would be retained as a single BLOB, and the appropriate format data
items simply be offsets and lengths within that BLOB, although with
cached metaData.

Of course, there is already a design within the existing code, and the
cost of wholesale redesign may be more than can be afforded.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Glenn Linderman writes:

 > Yes, I interpreted, possibly misinterpreted, Barry's comment about
 > storing things as bytes, as that he was figuring to store them in wire
 > format.

What that means is unclear, though.  Does a "header in wire format"
mean before or after MIME encoding?  Probably after, but that's pretty
useless for the purpose of editing the header.  Does it include the
tag (the part before the colon) or not?  Etc.

 > I would tend to agree with that, except that if something is
 > received/provided in a particular format, it might want to stay in that
 > format until such time it is needed in a different format... and then
 > the appropriate set of conversions (current format => internal format =>
 > needed format) applied as needed, avoiding all conversions when it is
 > already in the needed format.

If you mean that the email module will keep track of what form the
object is currently represented by, that will eventually result in
"UnicodeError: octet out of range: 161, ascii".

 > two conversions are slower than none, and use 2-4 times the space in
 > string format.

Let's get this correct, *then* optimize, please.

 > One has to write the conversion code anyway; it is just a matter of
 > where it is called.  Once converted, meta data could be retained in its
 > natural format.

Meta data for what?  Why would you convert meta data?

 > > 2.  MUA #1: Composition.  Input will be strings and multimedia file
 > >     names, output will be bytes.  Will attributes of message objects
 > >     be manipulated?  Not in a conventional MUA, but an email-based MUA
 > >     might find uses for that.
 >
 > I'm not sure what an email-based MUA is.... seems to me even a
 > conventional MUA is "email-based"???

Only if it's written using the Python email module.

 > > 4.  Mailing list processor.  Message input will be bytes.
 > >     Configuration input, including heading and footer texts that may
 > >     be added are likely to be strings.  Header manipulation (adding
 > >     topics, sequence numbers, RFC 2369 headers) most conveniently done
 > >     with strings.  Output will be bytes.
 > >  
 >
 > But the bulk of the message parts, received in wire format, may not need
 > to be altered to be sent along in the same wire format.

That depends.  For example, multimedia parts may simply be discarded,
in which case it makes sense to not convert them.  However, most
Mailman lists do add a footer, and because of crappy Windows MUAs that
don't implement MIME correctly, it's preferred to add that by
concatenating as text.  That simply cannot be done correctly in wire
format for any character set except ISO 8859/1.

 > Heading and footing texts are configured boilerplate, and could be
 > cached in a variety of formats to avoid the need to convert them for
 > each message,

Premature optimization is the root of all error.

 > An archiver could archive wire format,

Are you suggesting that the email module should mandate that?  We have
a severe tail-dog inversion problem here.

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by jalopyuser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Timothy Farrell <tfarrell@...> wrote:

> Back in June, David Murray posted the message below about fixing the
> email module.  I have an interest in helping with this due to a
> personal project I'm working on.  However, my ability to help is
> severely limited by my understanding of email and MIME RFCs.

Tim, familiarity with email and MIME RFCs would be a big help if you
want to help with the email module.  Even for writing test cases.

Bill
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Glenn Linderman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On approximately 10/6/2009 5:30 PM, came the following characters from
the keyboard of Stephen J. Turnbull:

> Glenn Linderman writes:
>
>  > Yes, I interpreted, possibly misinterpreted, Barry's comment about
>  > storing things as bytes, as that he was figuring to store them in wire
>  > format.
>
> What that means is unclear, though.  Does a "header in wire format"
> mean before or after MIME encoding?  Probably after, but that's pretty
> useless for the purpose of editing the header.  Does it include the
> tag (the part before the colon) or not?  Etc.
>
>  > I would tend to agree with that, except that if something is
>  > received/provided in a particular format, it might want to stay in that
>  > format until such time it is needed in a different format... and then
>  > the appropriate set of conversions (current format => internal format =>
>  > needed format) applied as needed, avoiding all conversions when it is
>  > already in the needed format.
>
> If you mean that the email module will keep track of what form the
> object is currently represented by, that will eventually result in
> "UnicodeError: octet out of range: 161, ascii".
>  

The above sentence does not communicate your meaning to me... or any
meaning, actually.  Can you explain?
If conversions are avoided, then octets are unlikely to be out of
range?  And the email module must be aware of the form of the data in
order to manipulate it in any format other than wire format, but
fortunately, wire format declares the format of the data (not to say
there is not buggy wire format data -- but that is an issue best avoided
by avoiding as many conversions as possible).

>  > two conversions are slower than none, and use 2-4 times the space in
>  > string format.
>
> Let's get this correct, *then* optimize, please.
>  

That's a nice platitude... I could have used it on you when you said
> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
but I didn't.  You can't design things totally ignoring the reality of
time and space performance, and expect to get an efficient result.  I
agree one can spend too much time on premature optimization issues, and
I have that tendency, but if you totally ignore time and space issues,
you wind up with Vista.


>  > One has to write the conversion code anyway; it is just a matter of
>  > where it is called.  Once converted, meta data could be retained in its
>  > natural format.
>
> Meta data for what?  Why would you convert meta data?
>  

Meta data for the email message... how many MIME parts, their
Content-Types, etc.  This is small amounts of data, but reasonably
likely to referenced multiple times during the message parsing or
creation and generation process.  So once it is converted from wire
format, it should be kept in a useful format, as well as wire format.


>  > > 2.  MUA #1: Composition.  Input will be strings and multimedia file
>  > >     names, output will be bytes.  Will attributes of message objects
>  > >     be manipulated?  Not in a conventional MUA, but an email-based MUA
>  > >     might find uses for that.
>  >
>  > I'm not sure what an email-based MUA is.... seems to me even a
>  > conventional MUA is "email-based"???
>
> Only if it's written using the Python email module.
>  

Um.  Aren't we talking about use cases for the Python email module?  I
was trying to interpret what you were saying in that light.  Sure, what
a conventional (not written using the Python email module) MUA does, is
mostly irrelevant, except so far as it shows use cases that might be
applied to email-based (written using the Python email module) MUAs.


>  > > 4.  Mailing list processor.  Message input will be bytes.
>  > >     Configuration input, including heading and footer texts that may
>  > >     be added are likely to be strings.  Header manipulation (adding
>  > >     topics, sequence numbers, RFC 2369 headers) most conveniently done
>  > >     with strings.  Output will be bytes.
>  > >  
>  >
>  > But the bulk of the message parts, received in wire format, may not need
>  > to be altered to be sent along in the same wire format.
>
> That depends.  For example, multimedia parts may simply be discarded,
> in which case it makes sense to not convert them.  However, most
> Mailman lists do add a footer, and because of crappy Windows MUAs that
> don't implement MIME correctly, it's preferred to add that by
> concatenating as text.  That simply cannot be done correctly in wire
> format for any character set except ISO 8859/1.
>  

Huh?

First off, which "crappy Windows MUAs" don't implement MIME correctly,
and what do they do wrong?  When I look at wire format emails, I'm
mostly appalled by the stuff generated by Apple Mail.  I have seen a few
doozies from Outlook 2000, but they seem to be fixed in newer versions.

Adding a header or trailer does require knowledge of the character set
and encoding of the message.  Given that, you can decode to str, add the
header or trailer and encode back to MIME.  So that's the inefficient
proof of concept.

In the identity or quopri encodings, it is possible to add similarly
encoded headers and trailers correctly to text/plain parts through
normal concatenation.  Adding headers to base64 encoding requires that
the encoded header be an exact number of base64 lines, or at least a
multiple of 3 characters and that you shuffle the line layout through
the whole base64 body... it is not clear that this is worth the work.  
Adding trailers to base64 encoding requires decoding the final partial
encoding, noticing how much room is left on that last line, and the
encoding from there on... so it is not possible to cache an encoded
base64 footer, although it would be possible to cache 3 of them, and
only have to tweak the merge and choose the right one of the three and
then reshuffle.  So since text/plain is seldom encoded in base64, and
base64 is so complex to concatenate to in wire format, I'd think it
would be a better choice to decode and reencode to concatenate headers
or footers to base64 encoded MIME parts.... unless immense base64
encoded MIME parts are expected to be common enough to develop the
optimized logic.

text/html is trickier, whether encoded or not.  You have to parse past
any stuff that precedes <body>, and place the header after that, and
then you have to find the </body> and place the trailer before that.  
And unless you run the HTML through a validity checker, you can't be
sure that the trailer will even show up, much less actually at the
bottom, due to the possibility of unclosed tags within the body.  To
parse even quopri encoded HTML gets tricky, and basically impossible for
base64 encoded HTML.  So the first text/html part likely will need to be
decoded for adding headers and trailers, if it is an alternative to the
text/plain part, or there is no text/plain part.

I've seen some systems add an additional MIME part to place a trailer
in, and that can be pretty effective for MUAs that will show multiple
parts in-line, but there are so many MUAs out there, that it is
extremely difficult to make any certain declarations regarding what the
user sees as a result.

And, ISO 8859/1 is an 8-bit character set, so would require encoding on
a 7bit transfer.  But it is not unique; if you know how to do ISO 8859/1
concatenation in wire format, then you can do the whole class of
ASCII+128 more character sets in the same manner.  Not to mention that
ASCII itself works fine in wire format.  And so does UTF-8.  It is just
a matter of matching the character set and the encoding.


>  > Heading and footing texts are configured boilerplate, and could be
>  > cached in a variety of formats to avoid the need to convert them for
>  > each message,
>
> Premature optimization is the root of all error.
>  

Yeah, yeah.  I said "could", not "must".  I was pushing back from your
declaration that:

>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  

Such configuration texts are likely to provided as strings, but there is
nothing to prevent them from being converted to other formats.  
Premature optimization may or may not be the root of all error, but
discarding perfectly valid design possibilities based on how the input
might be supplied seems a similar error.  I'm not declaring which design
is best, just that there are alternatives.

>  > An archiver could archive wire format,
>
> Are you suggesting that the email module should mandate that?  We have
> a severe tail-dog inversion problem here.

Absolutely not.  I said "could", not "must".  The archiver can do what
it wants.  The email library should provide access to the message data
in all useful formats, so that the archiver can do what it wants.  The
archiver needs to choose its design and optimizations appropriate for
its expected use cases.  I was pushing back from your declaration that
an archiver would always want string output.... you said:

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Glenn Linderman writes:

 > > If you mean that the email module will keep track of what form the
 > > object is currently represented by, that will eventually result in
 > > "UnicodeError: octet out of range: 161, ascii".
 >
 > The above sentence does not communicate your meaning to me... or any
 > meaning, actually.  Can you explain?

Yes, that Unicode error is one that took years for Mailman to work
around.  If we are going to be converting different objects at
different times, I'm sure we'll get to see it agin in the future.  Oh,
joy.

 > If conversions are avoided, then octets are unlikely to be out of
 > range?

Haven't looked in your spam bucket recently, I guess.  Spammers
regularly put 8 bit characters into headers (and into bodies in
messages without a Content-Type header), for one thing.

 > And the email module must be aware of the form of the data in
 > order to manipulate it in any format other than wire format, but
 > fortunately, wire format declares the format of the data (not to say
 > there is not buggy wire format data -- but that is an issue best avoided
 > by avoiding as many conversions as possible).

"Best" I can't speak to; you obviously are willing to accept a much
higher error rate than I am.  "Robust" handling of buggy wire format
data means that the email module must do something sane with it before
giving it to the application.  Maybe it's reasonable to do that
lazily, and/or cache the result, but access to bogus data (that the
email module can determine is bogus or suspicious) must not be allowed
unless the client says "hit me with your best shot" explicitly.  Most
clients are simply not going to be prepared for the kind of crap I see
in /var/mail/turnbull every day.

 > I was pushing back from your declaration that an archiver would
 > always want string output

Please don't push back; we won't get anywhere.  Use cases are
*examples*, not complete specifications of all possible inputs and
outputs.  Use cases should be simple and clear cut.  If you want a
different use case, state it.  In fact in the real world, *all* of the
archivers I know of produce text formats on disk, either deleting
multimedia objects or saving them off and linking to them via URLs in
the text.  If you know of a different kind of archiver, add it as a
use case.
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Oleg Broytmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
> Haven't looked in your spam bucket recently, I guess.  Spammers
> regularly put 8 bit characters into headers

   Legitimate but stupid programs do this as well. Think of phpbb-like
forums written by programmers who never understand how non-ascii can be put
into Subject field or filenames - they send amazingly crippled emails.

Oleg.
--
     Oleg Broytman            http://phd.pp.ru/            phd@...
           Programmers don't die, they just GOSUB without RETURN.
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Oleg Broytman writes:
 > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
 > > Haven't looked in your spam bucket recently, I guess.  Spammers
 > > regularly put 8 bit characters into headers
 >
 >    Legitimate but stupid programs do this as well.

Sure, but Glenn may not be subscribed to any of those.  *Everybody* is
subscribed to spam, though.

I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs,
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Anthony Baxter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 7, 2009 at 11:19 PM, Stephen J. Turnbull <stephen@...> wrote:
Oleg Broytman writes:
 > On Wed, Oct 07, 2009 at 07:33:42PM +0900, Stephen J. Turnbull wrote:
 > > Haven't looked in your spam bucket recently, I guess.  Spammers
 > > regularly put 8 bit characters into headers
 >
 >    Legitimate but stupid programs do this as well.

Sure, but Glenn may not be subscribed to any of those.  *Everybody* is
subscribed to spam, though.

I'll-let-you-decide-what-kind-of-smiley-that-needs-ly y'rs,

You'd be amazed how many MUAs shipped by major companies are broken. MS Entourage, anyone?

noone-mention-the-nested-multiparts-with-the-same-boundary-tag-on-both-levels-ly y'rs. 

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Matthew Dixon Cowles :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[Stephen J. Turnbull]
> Yes, that Unicode error is one that took years for Mailman to work
> around.  If we are going to be converting different objects at
> different times, I'm sure we'll get to see it again in the future.

In my opinion, the email module should never raise an exception as a
result of working with a malformed message. Though it should
certainly make the information that a message was malformed available
for the calling program to check.

That is, I think that it's extremely unlikely that the calling
program wants to blow up as a result of a malformed message. Very
probably, it wants to make what sense of the message that it can. The
number of ways in which a message can be malformed is pretty large
and just how (and when, as has been mentioned) any particular error
will cause problems for the module is really a matter that's internal
to the module. The module's user shouldn't have to say, "Over here I
have to trap UnicodeErrors and over there I have to trap IndexErrors".

Regards,
Matt

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Oleg Broytmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
> In my opinion, the email module should never raise an exception as a
> result of working with a malformed message. Though it should
> certainly make the information that a message was malformed available
> for the calling program to check.

   I disagree. email package is not a user agent, and exceptions are *the*
way to indicate there are problems.

> That is, I think that it's extremely unlikely that the calling
> program wants to blow up as a result of a malformed message.

   Then the calling program must catch all exceptions and process they in a
reasonable (for this particular application) way. But certainly email
package must not dictate what ways are reasonable - they are too
application-specific.

> Very
> probably, it wants to make what sense of the message that it can.

   Yes, if email parse a message in some way - ok. You can help by creating
more intelligent parser(s). But if a parser stumbles upon an unparseable
block - it must raises an exception.

Oleg.
--
     Oleg Broytman            http://phd.pp.ru/            phd@...
           Programmers don't die, they just GOSUB without RETURN.
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Glenn Linderman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On approximately 10/7/2009 3:33 AM, came the following characters from
the keyboard of Stephen J. Turnbull:

> Glenn Linderman writes:
>
>  > > If you mean that the email module will keep track of what form the
>  > > object is currently represented by, that will eventually result in
>  > > "UnicodeError: octet out of range: 161, ascii".
>  >
>  > The above sentence does not communicate your meaning to me... or any
>  > meaning, actually.  Can you explain?
>
> Yes, that Unicode error is one that took years for Mailman to work
> around.  If we are going to be converting different objects at
> different times, I'm sure we'll get to see it agin in the future.  Oh,
> joy.
>  

Ah, a historical remark!  So that's why it was lost on me, I'm new to
the Python world (but programming since 1975...)


>  > If conversions are avoided, then octets are unlikely to be out of
>  > range?
>
> Haven't looked in your spam bucket recently, I guess.  Spammers
> regularly put 8 bit characters into headers (and into bodies in
> messages without a Content-Type header), for one thing.
>  

I'm aware of that, but if conversions are not done, octets are unlikely
to be _reported_ to be out of range....


>  > And the email module must be aware of the form of the data in
>  > order to manipulate it in any format other than wire format, but
>  > fortunately, wire format declares the format of the data (not to say
>  > there is not buggy wire format data -- but that is an issue best avoided
>  > by avoiding as many conversions as possible).
>
> "Best" I can't speak to; you obviously are willing to accept a much
> higher error rate than I am.  "Robust" handling of buggy wire format
> data means that the email module must do something sane with it before
> giving it to the application.  Maybe it's reasonable to do that
> lazily, and/or cache the result, but access to bogus data (that the
> email module can determine is bogus or suspicious) must not be allowed
> unless the client says "hit me with your best shot" explicitly.  Most
> clients are simply not going to be prepared for the kind of crap I see
> in /var/mail/turnbull every day.
>  

Are you referring to most email clients, or most
Python-email-library-using clients?  It seems like most email clients
are being hit with the same stuff you are seeing... every day... and are
handling it somehow... although anti-spam filters do eliminate some of
it before the end user's MUA sees it, depending on the ISP, etc.

Is it your point of view, then, that incorrectly formed email should be
mostly treated as SPAM?  Your paragraph above could be interpreted that
way.  Oleg's point is also valid though, so it seems that isn't your
point of view.

Your "hit me with your best shot" comment indicates that you want a
failure code or exception when the data is bad, and then a way to "retry
accepting errors"?


>  > I was pushing back from your declaration that an archiver would
>  > always want string output
>
> Please don't push back; we won't get anywhere.  Use cases are
> *examples*, not complete specifications of all possible inputs and
> outputs.  Use cases should be simple and clear cut.  If you want a
> different use case, state it.  In fact in the real world, *all* of the
> archivers I know of produce text formats on disk, either deleting
> multimedia objects or saving them off and linking to them via URLs in
> the text.  If you know of a different kind of archiver, add it as a
> use case.
>  

I misunderstood the purpose of your list.  Sure, everything in your list
is a good example of real world uses.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Matthew Dixon Cowles :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[me]
> In my opinion, the email module should never raise an exception as a
> result of working with a malformed message.

[Oleg Broytman]
> I disagree. email package is not a user agent, and exceptions are
> *the* way to indicate there are problems.

We may have to agree to disagree. If the email package gives up
because a message is malformed, I don't know what exactly it's for.
It's certainly not for parsing what arrives in my mailbox.

> Then the calling program must catch all exceptions and process they
> in a reasonable (for this particular application) way.

Then the module's documentation would need to include a list of all
exceptions that it might raise and the times that it might raise
them. Otherwise the application developer is proceeding in the dark.

Regards,
Matt

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Oleg Broytmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 03:05:35PM -0500, Matthew Dixon Cowles wrote:
> If the email package gives up
> because a message is malformed, I don't know what exactly it's for.
> It's certainly not for parsing what arrives in my mailbox.

   Then it is *your* task to enhance the code. A flow of patches with tests
would be the best contribution.

> > Then the calling program must catch all exceptions and process they
> > in a reasonable (for this particular application) way.
>
> Then the module's documentation would need to include a list of all
> exceptions that it might raise and the times that it might raise
> them.

   You are also welcome to provide patches for documentation.

Oleg.
--
     Oleg Broytman            http://phd.pp.ru/            phd@...
           Programmers don't die, they just GOSUB without RETURN.
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: fixing the current email module

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Oct 3, 2009, at 11:41 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> So the basic model is: accept strings or bytes at the edges,
>> process everything internally as bytes, output strings and bytes at
>> the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.
So, I've taken at least two abortive attempts at updating the email  
package to Python 3, once using bytes internally and another time  
using strings internally.  Neither one was completely satisfying (to  
say the least).  I've also heard convincing arguments from folks in  
the Python community in both camps: "using anything other than strings  
internally is insane; no, using anything other than bytes internally  
is insane."

As for the internal representational format, I'll amend my previous  
statement and say that I'll keep an open mind, but one thing that  
seems very clear is that we have to be able to accept strings and  
bytes at the incoming edges, and produce strings and bytes at the  
outgoing edges.  In a future message, Stephen outlines some excellent  
use cases, to which I'll follow up when I get there.  But I think he  
generally hits the nail on the head and proves that we'll have both  
types at the edges.  That makes for very interesting API design!

There's "internal" and then there's the low-level representation that  
the model exposes.  Here I have more confidence that we need make  
things much more consistent.  The trick is to do that while still  
making things convenient.

For example, we currently represent header values as 8-bit strings or  
Header instances. The latter can contain triples of the individual  
chunks, e.g. (content, language, charset).  I think we need represent  
header values as instances in all cases because the type checking is  
error prone, but even then, it makes for difficult API choices.  
Still, if the fundamental atom of header values in the model is the  
Header, and we define both byte and string APIs for headers, then the  
internal representation matters less since only the email package  
implementers need to care.

But note that even in this limited case, neither bytes nor strings  
really works.  The internal representation is that triple (and in the  
current model an implicit triple where charset=us-ascii).  So  
internally the charset is carried along for the ride, as it must be.  
If the internal representation were just strings or bytes, we wouldn't  
know how to generate the other format, at least not idempotently (or  
as close as we can get).

Just to ramble a little longer, it's been argued that we should give  
up on idempotency, but I'm not convinced.  I think people want to see  
an email message they throw into the system come out the other end as  
closely as possible (well, /exactly/ for well-formed messages).

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (849 bytes) Download Attachment
< Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 | Next >