email package Bytes vs Unicode (was Re: [Python-Dev] Dropping bytes "support" in json)

View: New views
2 Messages — Rating Filter:   Alert me  

Parent Message unknown email package Bytes vs Unicode (was Re: [Python-Dev] Dropping bytes "support" in json)

by Tony Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

(email-sig added)

At 08:07 -0400 04/09/2009, Steve Holden wrote:
>Barry Warsaw wrote:
 ...

>> This is an interesting question, and something I'm struggling with for
>> the email package for 3.x.  It turns out to be pretty convenient to have
>> both a bytes and a string API, both for input and output, but I think
>> email really wants to be represented internally as bytes.  Maybe.  Or
>> maybe just for content bodies and not headers, or maybe both.  Anyway,
>> aside from that decision, I haven't come up with an elegant way to allow
>> /output/ in both bytes and strings (input is I think theoretically
>> easier by sniffing the arguments).
>>
>The real problem I came across in storing email in a relational database
>was the inability to store messages as Unicode. Some messages have a
>body in one encoding and an attachment in another, so the only ways to
>store the messages are either as a monolithic bytes string that gets
>parsed when the individual components are required or as a sequence of
>components in the database's preferred encoding (if you want to keep the
>original encoding most relational databases won't be able to help unless
>you store the components as bytes).
 ...

I found it confusing myself, and did it wrong for a while.  Now, I
understand that essages come over the wire as bytes, either 7-bit US-ASCII
or 8-bit whatever, and are parsed at the receiver.  I think of the database
as a wire to the future, and store the data as bytes (a BLOB), letting the
future receiver parse them as it did the first time, when I cleaned the
message.  Data I care to query is extracted into fields (in UTF-8, what I
usually use for char fields).  I have no need to store messages as Unicode,
and they aren't Unicode anyway.  I have no need ever to flatten a message
to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw
8-bit data.

If you need the data from the message, by all means extract it and store it
in whatever form is useful to the purpose of the database.  If you need the
entire message, store it intact in the database, as the bytes it is.  Email
isn't Unicode any more than a JPEG or other image types (often payloads in
a message) are Unicode.
--
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson@...>
      '                              <http://www.georgeanelson.com/>
_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

Parent Message unknown Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

by Barry Warsaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Apr 9, 2009, at 12:20 PM, Steve Holden wrote:

> PostgreSQL strongly encourages you to store text as encoded columns.
> Because emails lack an encoding it turns out this is a most  
> inconvenient
> storage type for it. Sadly BLOBs are such a pain in PostgreSQL that  
> it's
> easier to store the messages in external files and just use the
> relational database to index those files to retrieve content, so  
> that's
> what I ended up doing.

That's not insane for other reasons.  Do you really want to store 10MB  
of mp3 data in your database?

Which of course reminds me that I want to add an interface, probably  
to the parser and message class, to allow an application to store  
message payloads in other than memory.  Parsing and holding onto  
messages with huge payloads can kill some applications, when you might  
not care too much about the actual payload content.

Barry



_______________________________________________
Email-SIG mailing list
Email-SIG@...
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

PGP.sig (313 bytes) Download Attachment