Re: [Python-Dev] Dropping bytes "support" in json

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 | Next >

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:

> On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw <barry@...> wrote:
> Anyway, aside from that decision, I haven't come up with an elegant  
> way to allow /output/ in both bytes and strings (input is I think  
> theoretically easier by sniffing the arguments).
>
> Won't this work? (assuming dumps() always returns a string)
>
> def dumpb(obj, encoding='utf-8', *args, **kw):
>     s = dumps(obj, *args, **kw)
>     return s.encode(encoding)
So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.  What should this return:

 >>> message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one.  Now how do you spell the other way?

The Message class probably has these explicit methods:

 >>> Message.get_header_bytes('Subject')
 >>> Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to  
message['Subject'] but which is the more obvious choice?

Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in the  
ASCII range and you'd like to leave the header value unencoded if so.  
But in both cases, you might have bytes or characters outside that  
range, so you need an explicit encoding, defaulting to utf-8 probably.

 >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
 >>> Message.set_header('Subject', b'Some bytes')

One of those maps to

 >>> message['Subject'] = ???

I'm open to any suggestions here!
-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 9, 2009, at 10:52 PM, Aahz wrote:

> On Thu, Apr 09, 2009, Barry Warsaw wrote:
>>
>> So, what I'm really asking is this.  Let's say you agree that there  
>> are
>> use cases for accessing a header value as either the raw encoded  
>> bytes or
>> the decoded unicode.  What should this return:
>>
>>>>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> Let's make that the raw bytes by default -- we can add a parameter to
> Message() to specify that the default where possible is unicode for
> returned values, if that isn't too painful.
I don't know whether the parameter thing will work or not, but you're  
probably right that we need to get the bytes-everywhere API first.

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:

> Barry Warsaw wrote:
>> I don't know whether the parameter thing will work or not, but you're
>> probably right that we need to get the bytes-everywhere API first.
>
> Given that json is a wire protocol, that sounds like the right  
> approach
> for json as well. Once bytes-everywhere works, then a text API can be
> built on top of it, but it is difficult to build a bytes API on top  
> of a
> text one.
Agreed!

> So I guess the IO library *is* the right model: bytes at the bottom of
> the stack, with text as a wrapper around it (mediated by codecs).

Yes, that's a very interesting (and proven?) model.  I don't quite see  
how we could apply that email and json, but it seems like there's a  
good idea there. ;)

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Re: [Python-Dev] Dropping bytes "support" in json

by Tony Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 22:38 -0400 04/09/2009, Barry Warsaw wrote:
 ...
>So, what I'm really asking is this.  Let's say you agree that there
>are use cases for accessing a header value as either the raw encoded
>bytes or the decoded unicode.  What should this return:
>
> >>> message['Subject']
>
>The raw bytes or the decoded unicode?

That's an easy one:  Subject: is an unstructured header, so it must be
text, thus Unicode.  We're looking at a high-level representation of an
email message, with parsed header fields and a MIME message tree.


>Okay, so you've picked one.  Now how do you spell the other way?

message.get_header_bytes('Subject')

Oh, I see that's what you picked.

>The Message class probably has these explicit methods:
>
> >>> Message.get_header_bytes('Subject')
> >>> Message.get_header_string('Subject')
>
>(or better names... it's late and I'm tired ;).  One of those maps to
>message['Subject'] but which is the more obvious choice?

Structured header fields are more of a problem.  Any header with addresses
should return a list of addresses.  I think the default return type should
depend on the data type.  To get an explicit bytes or string or list of
addresses, be explicit; otherwise, for convenience, return the appropriate
type for the particular header field name.


>Now, setting headers.  Sometimes you have some unicode thing and
>sometimes you have some bytes.  You need to end up with bytes in the
>ASCII range and you'd like to leave the header value unencoded if so.
>But in both cases, you might have bytes or characters outside that
>range, so you need an explicit encoding, defaulting to utf-8 probably.

Never for header fields.  The default is always RFC 2047, unless it isn't,
say for params.

The Message class should create an object of the appropriate subclass of
Header based on the name (or use the existing object, see other
discussion), and that should inspect its argument and DTRT or complain.

>
> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
> >>> Message.set_header('Subject', b'Some bytes')
>
>One of those maps to
>
> >>> message['Subject'] = ???

The expected data type should depend on the header field.  For Subject:, it
should be bytes to be parsed or verbatim text.  For To:, it should be a
list of addresses or bytes or text to be parsed.

The email package should be pythonic, and not require deep understanding of
dozens of RFCs to use properly.  Users don't need to know about the raw
bytes; that's the whole point of MIME and any email package.  It should be
easy to set header fields with their natural data types, and doing it with
bad data should produce an error.  This may require a bit more care in the
message parser, to always produce a parsed message with defects.
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@...>
      '                              <http://www.georgeanelson.com/>
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 1:19 AM, glyph@... wrote:

> On 02:38 am, barry@... wrote:
>> So, what I'm really asking is this.  Let's say you agree that there  
>> are use cases for accessing a header value as either the raw  
>> encoded bytes or the decoded unicode.  What should this return:
>>
>> >>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> My personal preference would be to just get deprecate this API, and  
> get rid of it, replacing it with a slightly more explicit one.
>
>   message.headers['Subject']
>   message.bytes_headers['Subject']
This is pretty darn clever Glyph.  Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers  
should be the decoded header (rather than have .headers return the  
bytes thingie and say .decoded_headers return the decoded thingies),  
but I do like the general approach.

>> Now, setting headers.  Sometimes you have some unicode thing and  
>> sometimes you have some bytes.  You need to end up with bytes in  
>> the ASCII range and you'd like to leave the header value unencoded  
>> if so. But in both cases, you might have bytes or characters  
>> outside that range, so you need an explicit encoding, defaulting to  
>> utf-8 probably.
>
>   message.headers['Subject'] = 'Some text'
>
> should be equivalent to
>
>   message.headers['Subject'] = Header('Some text')
Yes, absolutely.  I think we're all in general agreement that header  
values should be instances of Header, or subclasses thereof.

> My preference would be that
>
>   message.headers['Subject'] = b'Some Bytes'
>
> would simply raise an exception.  If you've got some bytes, you  
> should instead do
>
>   message.bytes_headers['Subject'] = b'Some Bytes'
>
> or
>
>   message.headers['Subject'] = Header(bytes=b'Some Bytes',  
> encoding='utf-8')
>
> Explicit is better than implicit, right?
Yes.

Again, I really like the general idea, if I might quibble about some  
of the details.  Thanks for a great suggestion.

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 9, 2009, at 11:41 PM, Tony Nelson wrote:

> At 22:38 -0400 04/09/2009, Barry Warsaw wrote:
> ...
>> So, what I'm really asking is this.  Let's say you agree that there
>> are use cases for accessing a header value as either the raw encoded
>> bytes or the decoded unicode.  What should this return:
>>
>>>>> message['Subject']
>>
>> The raw bytes or the decoded unicode?
>
> That's an easy one:  Subject: is an unstructured header, so it must be
> text, thus Unicode.  We're looking at a high-level representation of  
> an
> email message, with parsed header fields and a MIME message tree.
I'm liking Glyph's suggestion here.  We'll probably have to support  
the message['Subject'] API for backward compatibility, but in that  
case it really should be a bytes API.

>> (or better names... it's late and I'm tired ;).  One of those maps to
>> message['Subject'] but which is the more obvious choice?
>
> Structured header fields are more of a problem.  Any header with  
> addresses
> should return a list of addresses.  I think the default return type  
> should
> depend on the data type.  To get an explicit bytes or string or list  
> of
> addresses, be explicit; otherwise, for convenience, return the  
> appropriate
> type for the particular header field name.
Yes, structured headers are trickier.  In a separate message, James  
Knight makes some excellent points, which I agree with.  However the  
email package obviously cannot support every time of structured header  
possible.  It must support this through extensibility.

The obvious way is through inheritance (i.e. subclasses of Header),  
but in my experience, using inheritance of the Message class really  
doesn't work very well.  You need to pass around factories to parsing  
functions and your application tends to have its own hierarchy of  
subclasses for whatever extra things it needs.  ISTM that subclassing  
is simply not the right pattern to support extensibility in the  
Message objects or Header objects.  Yes, this leads me to think that  
all the MIME* subclasses are essentially /wrong/.

Having said all that, the email package must support structured  
headers.  Look at the insanity which is the current folding whitespace  
splitting and the impossibility of the current code to do the right  
thing for say Subject headers and Received headers, and you begin to  
see why it must be possible to extend this stuff.

>> Now, setting headers.  Sometimes you have some unicode thing and
>> sometimes you have some bytes.  You need to end up with bytes in the
>> ASCII range and you'd like to leave the header value unencoded if so.
>> But in both cases, you might have bytes or characters outside that
>> range, so you need an explicit encoding, defaulting to utf-8  
>> probably.
>
> Never for header fields.  The default is always RFC 2047, unless it  
> isn't,
> say for params.
>
> The Message class should create an object of the appropriate  
> subclass of
> Header based on the name (or use the existing object, see other
> discussion), and that should inspect its argument and DTRT or  
> complain.

>>>>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>>>>> Message.set_header('Subject', b'Some bytes')
>>
>> One of those maps to
>>
>>>>> message['Subject'] = ???
>
> The expected data type should depend on the header field.  For  
> Subject:, it
> should be bytes to be parsed or verbatim text.  For To:, it should  
> be a
> list of addresses or bytes or text to be parsed.
At a higher level, yes.  At the low level, it has to be bytes.

> The email package should be pythonic, and not require deep  
> understanding of
> dozens of RFCs to use properly.  Users don't need to know about the  
> raw
> bytes; that's the whole point of MIME and any email package.  It  
> should be
> easy to set header fields with their natural data types, and doing  
> it with
> bad data should produce an error.  This may require a bit more care  
> in the
> message parser, to always produce a parsed message with defects.
I agree that we should have some higher level APIs that make it easy  
to compose email messages, and probably easy-ish to parse a byte  
stream into an email message tree.  But we can't build those without  
the lower level raw support.  I'm also convinced that this lower level  
will be the domain of those crazy enough to have the RFCs tattooed to  
the back of their eyelids.

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Re: [Python-Dev] Dropping bytes "support" in json

by Glenn Linderman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On approximately 4/10/2009 9:56 AM, came the following characters from
the keyboard of Barry Warsaw:

> On Apr 10, 2009, at 1:19 AM, glyph@... wrote:
>> On 02:38 am, barry@... wrote:
>>> So, what I'm really asking is this.  Let's say you agree that there
>>> are use cases for accessing a header value as either the raw encoded
>>> bytes or the decoded unicode.  What should this return:
>>>
>>> >>> message['Subject']
>>>
>>> The raw bytes or the decoded unicode?
>>
>> My personal preference would be to just get deprecate this API, and
>> get rid of it, replacing it with a slightly more explicit one.
>>
>>   message.headers['Subject']
>>   message.bytes_headers['Subject']
>
> This is pretty darn clever Glyph.  Stop that! :)
>
> I'm not 100% sure I like the name .bytes_headers or that .headers
> should be the decoded header (rather than have .headers return the
> bytes thingie and say .decoded_headers return the decoded thingies),
> but I do like the general approach.

If one name has to be longer than the other, it should be the bytes
version.  Real user code is more likely to want to use the text version,
and hopefully there will be more of that type of code than
implementations using bytes.

Of course, one could use message.header and message.bythdr and they'd be
the same length.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

> If one name has to be longer than the other, it should be the bytes  
> version.  Real user code is more likely to want to use the text  
> version, and hopefully there will be more of that type of code than  
> implementations using bytes.

I'm not sure we know that yet, actually.  Nothing written for Python 2  
counts, and email is too broken in 3 for any sane person to be writing  
such code for Python 3.

> Of course, one could use message.header and message.bythdr and  
> they'd be the same length.

I was trying to figure out what  a 'thdr' was that we'd want to index  
'by' it. :)

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:

> Shouldn't headers always be text?

/me weeps



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:

> Until you write a parser for every header, you simply cannot decode  
> to unicode. The only sane choices are:
> 1) raw bytes
> 2) parsed structured data

The email package does not need a parser for every header, but it  
should provide a framework that applications (or third party  
libraries) can use to extend the built-in header parsers.  A bare  
minimum for functionality requires a Content-Type parser.  I think the  
email package should also include an address header (Originator,  
Destination) parser, and a Message-ID header parser.  Possibly  
others.  The default would probably be some unstructured parser for  
headers like Subject.

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

> If one name has to be longer than the other, it should be the bytes  
> version.  Real user code is more likely to want to use the text  
> version, and hopefully there will be more of that type of code than  
> implementations using bytes.
>
> Of course, one could use message.header and message.bythdr and  
> they'd be the same length.

Actually, thinking about this over the weekend, it's much better for  
message['subject'] to return a Header instance in all cases.  Use  
bytes(header) to get the raw bytes.

A good API for getting the parsed and decoded header values needs to  
take into account that it won't always be a string.  For unstructured  
headers like Subject, str(header) would work just fine.  For an  
Originator or Destination address, what does str(header) return?  And  
what would be the API for getting the set of realname/addresses out of  
the header?

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] Dropping bytes "support" in json

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 10, 2009, at 3:22 PM, Stephen J. Turnbull wrote:

> Robert Brewer writes:
>
>> Syntactically, there's no sense in providing:
>>
>>    Message.set_header('Subject', 'Some text', encoding='utf-16')
>>
>> ...since you could more clearly write the same as:
>>
>>    Message.set_header('Subject', 'Some text'.encode('utf-16'))
>
> Which you now must *parse* and guess the encoding to determine how to
> RFC-2047-encode the binary mush.  I think the encoding parameter is
> necessary here.
Agreed!  In fact, it's redundant to explicitly encode the string.  So  
the first spelling is preferred.

>> But it would be far easier to do all the encoding at once in an
>> output() or serialize() method. Do different headers need different
>> encodings?
>
> You can have multiple encodings within a single header (and a naïve
> algorithm might very well encode "The price of Gödel-Escher-Bach is
> €25" as "The price of =?ISO-8859-1?Q?G=F6del-Escher-Bach?= is
> =?ISO-8859-15?Q?=A425?=").

Isn't email just wonderful?  Please, spam and Facebook, kill it off  
once and for all, won't you?

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Parent Message unknown Re: [Python-Dev] headers api for email package

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:

> Barry Warsaw wrote:
>> >>> message['Subject']
>> The raw bytes or the decoded unicode?
>
> A header object.

Yep.  You got there before I did. :)

>> Okay, so you've picked one.  Now how do you spell the other way?
>
> str(message['Subject'])

Yes for unstructured headers like Subject.  For structured headers...  
hmm.

> bytes(message['Subject'])

Yes.

>> Now, setting headers.  Sometimes you have some unicode thing and  
>> sometimes you have some bytes.  You need to end up with bytes in  
>> the ASCII range and you'd like to leave the header value unencoded  
>> if so.  But in both cases, you might have bytes or characters  
>> outside that range, so you need an explicit encoding, defaulting to  
>> utf-8 probably.
>> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
>> >>> Message.set_header('Subject', b'Some bytes')
>
> Where you just want "a damned valid email and stop making my life  
> hard!":
>
> Message['Subject']='Some text'
Yes.  In which case I propose we guess the encoding as 1) ascii, 2)  
utf-8, 3) wtf?

> Where you care about what encoding is used:
>
> Message['Subject']=Header('Some text',encoding='utf-8')

Yes.

> If you have bytes, for whatever reason:
>
> Message['Subject']=b'some bytes'.decode('utf-8')
>
> ...because only you know what encoding those bytes use!

So you're saying that __setitem__() should not accept raw bytes?

-Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment

Re: [Python-Dev] headers api for email package

by R. David Murray :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 13 Apr 2009 at 10:28, Barry Warsaw wrote:
> On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
>
>> Barry Warsaw wrote:
>> > > > >  message['Subject']
>> > The raw bytes or the decoded unicode?
>>
>> A header object.
>
> Yep.  You got there before I did. :)

+1

>> > Okay, so you've picked one.  Now how do you spell the other way?
>>
>> str(message['Subject'])
>
> Yes for unstructured headers like Subject.  For structured headers... hmm.

Some "reasonable" printable interpretation that has no semantic meaning?

>> bytes(message['Subject'])
>
> Yes.
>
>> > Now, setting headers.  Sometimes you have some unicode thing and
>> > sometimes you have some bytes.  You need to end up with bytes in the
>> > ASCII range and you'd like to leave the header value unencoded if so.
>> > But in both cases, you might have bytes or characters outside that range,
>> > so you need an explicit encoding, defaulting to utf-8 probably.
>> > > > >  Message.set_header('Subject', 'Some text', encoding='utf-8')
>> > > > >  Message.set_header('Subject', b'Some bytes')
>>
>> Where you just want "a damned valid email and stop making my life hard!":
>>
>> Message['Subject']='Some text'
>
> Yes.  In which case I propose we guess the encoding as 1) ascii, 2) utf-8, 3)
> wtf?

Given some usenet postings I've just dealt with, (3) appears to
sometimes be spelled 'x-unknown' and sometimes (in the most recent case)
'unknown-8bit'.  A quick google turns up a hit on RFC1428 for the latter,
and a bunch of trouble tickets for the former...so I think 'wtf' is
correctly spelled 'unknown-8bit'.

However, it's not supposed to be used by mail composers, who are
expected to know the encoding.  It's for mail gateways that are
transforming something and don't know the encoding.  I'm not
sure what this means for the email module, which certainly
will be used in a mail gateways....maybe it's the responsibility
of the application code to explicitly say 'unknown encoding'?

>> Where you care about what encoding is used:
>>
>> Message['Subject']=Header('Some text',encoding='utf-8')
>
> Yes.
>
>> If you have bytes, for whatever reason:
>>
>> Message['Subject']=b'some bytes'.decode('utf-8')
>>
>> ...because only you know what encoding those bytes use!
>
> So you're saying that __setitem__() should not accept raw bytes?

If I'm understanding things correctly, if it did accept bytes the
person using that interface would need to do whatever encoding (eg:
encoded-word) was needed, so the interface should check that the byte
string is 8 bit clean.  But having some sort of 'setraw' method on Header
might be better for that case.

--David
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: [Python-Dev] Dropping bytes "support" in json

by Tony Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 10:11 -0400 04/13/2009, Barry Warsaw wrote:

>On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:
>
>> Until you write a parser for every header, you simply cannot decode
>> to unicode. The only sane choices are:
>> 1) raw bytes
>> 2) parsed structured data
>
>The email package does not need a parser for every header, but it
>should provide a framework that applications (or third party
>libraries) can use to extend the built-in header parsers.  A bare
>minimum for functionality requires a Content-Type parser.  I think the
>email package should also include an address header (Originator,
>Destination) parser, and a Message-ID header parser.  Possibly
>others.  The default would probably be some unstructured parser for
>headers like Subject.

I think the email package should have a parser for every header.  All the
headers defined in normal mail RFCs should have their own parser, and there
would be a default parser for unhandled headers, probably the Unstructured
parser.  Users could add their own, probably by importing something module
that knew how to add its parsing to the email package parsers.
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@...>
      '                              <http://www.georgeanelson.com/>
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: [Python-Dev] Dropping bytes "support" in json

by Tony Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 10:14 -0400 04/13/2009, Barry Warsaw wrote:
 ...
>Actually, thinking about this over the weekend, it's much better for
>message['subject'] to return a Header instance in all cases.  Use
>bytes(header) to get the raw bytes.

I don't agree.  I'd want it to return the appropriate type for that header:
string for Subject:, a list of addresses for To:, and so on.  Either the
user knows what to expect, or they'll learn immediately.  If they get a
Header, they have to then extract the appropriate data from it, based on
its type (but they only know the name).

OK, Header instances could have a .useful field that returned the useful
data in all instances.  But in any case, the email package should guide
users in the correct usage, rather than leaving every choice seeming equal,
when only one choice is correct.


>A good API for getting the parsed and decoded header values needs to
>take into account that it won't always be a string.  For unstructured
>headers like Subject, str(header) would work just fine.  For an
>Originator or Destination address, what does str(header) return?  And
>what would be the API for getting the set of realname/addresses out of
>the header?

msg[<headername>] would be the preferred way.

msg.get_header(<headername>).useful would return the useful data form of
any header.

msg.get_header(<headername>).addresses would return the address list from
any address Header, and raise AttributeError with other Headers.
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@...>
      '                              <http://www.georgeanelson.com/>
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: [Python-Dev] headers api for email package

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Barry Warsaw writes:
 > On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
 >
 > > Barry Warsaw wrote:
 > >> >>> message['Subject']
 > >> The raw bytes or the decoded unicode?
 > >
 > > A header object.
 >
 > Yep.  You got there before I did. :)
 >
 > >> Okay, so you've picked one.  Now how do you spell the other way?
 > >
 > > str(message['Subject'])
 >
 > Yes for unstructured headers like Subject.  For structured headers...  
 > hmm.

Well, suppose we get really radical here.  *People* see email as
(rich-)text.  So ... message['Subject'] returns an object, partly to
be consistent with more complex headers' APIs, but partly to remind us
that nothing in email is as simple as it seems.  Now,
str(message['Subject']) is really for presentation to the user, right?
OK, so let's make it a presentation function!  Decode the MIME-words,
optionally unfold folded lines, optionally compress spaces, etc.  This
by default returns the subject field as a single, possibly quite long,
line.  Then a higher-level API can rewrap it, add fonts etc, for fancy
presentation.  This also suggests that we don't the field tag (ie,
"Subject") to be part of this value.

Of course a *really* smart higher-level API would access structured
headers based on their structure, not on the one-size-fits-all str()
conversion.

Then MTAs see email as a string of octets.  So guess what:

 > > bytes(message['Subject'])

gives wire format.  Yow!  I think I'm just joking.  Right?

 > >> Now, setting headers.  Sometimes you have some unicode thing and  
 > >> sometimes you have some bytes.  You need to end up with bytes in  
 > >> the ASCII range and you'd like to leave the header value unencoded  
 > >> if so.  But in both cases, you might have bytes or characters  
 > >> outside that range, so you need an explicit encoding, defaulting to  
 > >> utf-8 probably.
 > >> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
 > >> >>> Message.set_header('Subject', b'Some bytes')
 > >
 > > Where you just want "a damned valid email and stop making my life  
 > > hard!":

-1  I mean, yeah, Brother, I feel your pain but it just isn't that
easy.  If that were feasible, it would be *criminal* to have a
.set_header() method at all!  In fact,

 > > Message['Subject']='Some text'

is going to (a) need to take *only* unicodes, or (b) raise Exceptions
at the slightest provocation when handed bytes.

And things only get worse if you try to provide this interface for say
"From" (let alone "Content-Type").  Is it really worth doing the
mapping interface if it's only usable with free-form headers (ie, only
Subject among the commonly used headers)?

 > Yes.  In which case I propose we guess the encoding as 1) ascii, 2)  
 > utf-8, 3) wtf?

Uh, what guessing?  If you don't know what you have but you believe it
to be a valid header field, then presumably you got it off the wire
and it's still in bytes and you just spit it out on the wire without
trying to decode or encode it.  But as I already said, I think that's
a bad idea.  Otherwise, you should have a unicode, and you simply look
at the range of the string.  If it fits in ASCII, Bob's your uncle.
If not, Bob's your aunt (and you use UTF-8).

 > > Where you care about what encoding is used:
 > >
 > > Message['Subject']=Header('Some text',encoding='utf-8')
 >
 > Yes.
 >
 > > If you have bytes, for whatever reason:
 > >
 > > Message['Subject']=b'some bytes'.decode('utf-8')
 > >
 > > ...because only you know what encoding those bytes use!
 >
 > So you're saying that __setitem__() should not accept raw bytes?

How do you distinguish "raw" bytes from "encoded bytes"?
__setitem__() shouldn't accept bytes at all.  There should be an API
which sets a .formatted_for_the_wire member, and it should have a
"validate" option (ie, when true the API attempts to parse the header
and raises an exception if it fails to do so; when false, it assumes
you know what you're doing and will send out the bytes verbatim).

_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

API for Header objects [was: Dropping bytes "support" in json]

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tony Nelson writes:

 > OK, Header instances could have a .useful field that returned the useful
 > data in all instances.  But in any case, the email package should guide
 > users in the correct usage, rather than leaving every choice seeming equal,
 > when only one choice is correct.

What do you mean by "only one choice is correct?"  For example, a
Destination field might be used for presentation (in which case the
display name are needed), or to compose a list of recipients (when
thjey should be discarded).  Some applications might prefer to receive
the combination as the original string (although that often is not
valid RFC-any), others might prefer it parsed into a pair of display
name and mailbox.

Quoth Barry Warsaw:

 > >A good API for getting the parsed and decoded header values needs to
 > >take into account that it won't always be a string.  For unstructured
 > >headers like Subject, str(header) would work just fine.  For an
 > >Originator or Destination address, what does str(header) return?

A string (not folded) of comma-separated addresses in "Display Name"
<po@...> form.

 > >And what would be the API for getting the set of
 > >realname/addresses out of the header?

Does there need to be one?  An AddressHeader object could support
indexing: message['To'][0] returns the first displayname,mailbox pair.
If you really want a list, what's wrong with list(header)?  (Yes, I
recall that you (Barry) said you don't think subclassing worked very
well, but I wonder if maybe we can't get it righter this time around.)

 > msg[<headername>] would be the preferred way.

This goes against the principle that this returns a Header object.
For one thing, I really think that there need to be some common
methods all Header objects support, like str() and to_wire_format().
Also, if this returns a list for 'To', then str(msg['To']) won't work
right: it will return the list enclosed in square brackets and the
mailbox portions will be quoted, which isn't useful.

 > msg.get_header(<headername>).useful would return the useful data form of
 > any header.

Er, shouldn't we just throw away the data that is never useful?<wink>

 > msg.get_header(<headername>).addresses would return the address list from
 > any address Header, and raise AttributeError with other Headers.

Yes, but a list of what?  Strings?  Bytes?  Displayname/mailbox pairs?
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Re: API for Header objects [was: Dropping bytes "support" in json]

by Tony Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 03:38 +0900 04/14/2009, Stephen J. Turnbull wrote:

>Tony Nelson writes:
>
> > OK, Header instances could have a .useful field that returned the useful
> > data in all instances.  But in any case, the email package should guide
> > users in the correct usage, rather than leaving every choice seeming equal,
> > when only one choice is correct.
>
>What do you mean by "only one choice is correct?"  For example, a
>Destination field might be used for presentation (in which case the
>display name are needed), or to compose a list of recipients (when
>thjey should be discarded).  Some applications might prefer to receive
>the combination as the original string (although that often is not
>valid RFC-any), others might prefer it parsed into a pair of display
>name and mailbox.
 ...

Assuming that by "Destination" you mean a class of Address header fields,
as there is no Destionation: header field, such header fields contain
addresses, which can be considered to contain (as the email package does) a
list of (name, email address) pairs, or, at a lower level, to also have
Comments, there is indeed only one correct choice, which is the one the
email package currently provides the diligent user.  I wish it to be the
one obvious choice, so that less study is needed to properly use the email
package.

Any use that wishes to discard the email addresses in favor of the friendly
names can do so most easily from the parsed [(name, address)], not from the
bytes.  Parsing Address header fields is hard.  Note that Address headers
are not Text, as only certain tokens -- not part of the email addresses --
can be RFC 2047-encoded.
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@...>
      '                              <http://www.georgeanelson.com/>
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Parent Message unknown Re: [Python-Dev] headers api for email package

by Stephen J. Turnbull :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Removing Python-Dev from the addressees.

Steven D'Aprano writes:
 > On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:
 >
 > > *People* see email as (rich-)text.
 >
 > We do?

Yup.  You don't see the email, you see a *presentation* of that email.
That presentation is usually text, plus possible some other stuff
(fonts, highlighting, active links, images).  Thus the "(rich-)".

 > It's not clear what you actually mean by "(rich-)text".

I mean presentation.  I mean "human readable".  I mean Unicode.  I
mean "Do Not Feed The Program" (not for machine processing -- so your
associations with virii are completely off the mark).

 > rich-text. I guess you mean Unicode characters. Am I right?

No.  I mean presentation, which for Python purposes includes but is
not limited to Unicode.

 > Now, correct me if I'm wrong, but I don't think mail headers can
 > actually be anything *but* bytes.

On the wire.  email's Headers have applications other than putting
bytes on the wire.

 > If you're proposing converting those bytes into characters, that's all
 > very well and good, but what's your strategy for dealing with the
 > inevitable wrongly-formatted headers?

Whatever you want it to be.  There are a number of such strategies,
some of which should be among the batteries we include.
Header.__str__() will need to know how to find out which is in effect,
of course.

 > If the header can't be correctly decoded into text, there still
 > needs to be a way to get to the raw bytes.

Sure.  That's what Header.__bytes__() will do.  Specifically, if you
have a Header that was parsed out of a message received over the wire,
it will return a verbatim copy of the header as received, folding
whitespace, CRLFs, and all.  If the Header was constructed (including
editing a received header), then __bytes__ will construct the wire
format, and optionally cache it as if it were a received header.  (But
this has some gotchas, see below.)

 > >  > > bytes(message['Subject'])
 > >
 > > gives wire format.  Yow!  I think I'm just joking.  Right?
 >
 > Er, I'm not sure. Are you joking? I hope not, because it is important to
 > be able to get to the raw, unmodified bytes that the MTA sees, without
 > all the fancy processing you suggest.

Er, I'm not suggesting any processing in particular.  I'm suggesting
an API in which str(header) produces a text/plain rendering of the
field contents, with no folding, MIME words, or other wire format
detritus, suitable for human viewing, more or less (specifically, it
might be a rather long line).  bytes(header) produces the wire format,
either verbatim as received or as constructed based on client input.

Note that an issue here is that a received header may be bogus, in
which case you *don't* want bytes(header) to simply return the
original and then spew over the wire.  Should it raise an Exception or
"fix up" the bytes?  I don't know, and thus I wonder if this proposed
API might just be a joke, not something you can dare use in a
production application.

Of course, str() and bytes() as proposed here are not necessarily what
you want.  So there will need to be ways to access the internal
representation of Header directly (or via further specialized
formatter functions if string or bytes format is preferred to
structured objects).

 > Again, correct me if I'm wrong, but *all* valid mail headers must fit in
 > ASCII.

Of course, that's true on the wire.  I've assumed that everybody here
is assuming STD 11 (currently RFC 822 according to rfc-editor.org)
folding of long header lines and RFC 2047 encoding of characters
outside of the restricted-ASCII repertoire (RFC 5322 at least doesn't
permit all the ASCII control characters) before putting it on the
wire.  This is basically a solved problem, though, so I didn't bother
mentioning it.  Sorry for the confusion.

But what we're talking about *here* are email APIs that may or may not
be directly connected to a display or wire.  There is no reason why
headers *must* be represented as bytes, strings, or anything else in a
Header, and no reason why the bytes or str format *must* be RFC
compatible.

I think it's quite sensible to specify "bytes(header) will be RFC
5322-conforming", but we need to specify how to handle bogus headers
that we have received and not edited.  Should we ever raise an
Exception, and if so, in what contexts?  Should we "fix up" the
bogosity somehow?  Should we delete the offensive header?  Should we
pass it on verbatim, and leave it to a higher level to verify(!) and
decide what to do about it?  Do the RFCs say anything about all this
(eg, with broken trace headers I think it's implied that we pass them
on verbatim)?
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
< Prev | 1 - 2 - 3 | Next >