XML processing

View: New views
7 Messages — Rating Filter:   Alert me  

XML processing

by Neil Munro :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello all
            I'm a final year university student in the UK and I've chosen python as my language of choice for implementing my desired solution but I'm having issues with getting xml working, the problem is that I've done a lot of reading discovered lots of different libraries and means of doing things now I'm unsure which one to use and how.

I'm most familiar with DOM and SAX now, or rather the overview of them, from what i understand DOM loads the whole document into memory and retains it until the user removes it, SAX by contrast only holds tiny bits at once in memory, now I'm of the impression DOM is great for doing a lot of XML editing where as SAX is great for reading a file in and extracting the information out of it as you go.

So for my initial XML usage (for my program is going to deal with a lot of them) I have used SAX to write a basic preferences file (which is part of my programs design) that works, now, I am at a loss to understand how to read XML data back in, too many libraries to choose from and different examples show different things it's hard to fully understand how things work, if someone can poin me to tutorials and answer questions on them I'd appriciate it.

Many thanks
Neil Munro

_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by John W. Shipman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 13 Feb 2009, Neil Munro wrote:

+--
| I'm a final year university student in the UK and I've chosen
| python as my language of choice for implementing my desired solution...
|
| ...if someone can poin me to tutorials and answer questions on them
+--

Here's some documentation for my Python XML tool of choice.  I'll
be happy to answer questions.

     http://www.nmt.edu/tcc/help/pubs/pylxml/

I avoided this approach, Fredrik Lundh's ElementTree model, for
some time because of its differences from DOM, but what won my
over was screaming speed.  After I slurped a 500KB file into
memory in about 300msec, I was a convert.

The last section of the above document contains my adaptation of
Fredrik Lundh's builder.py module, which makes the construction
of XML so very easy.  I use it for all my new dynamic Web and
other XML generation tasks now.

This document does not discuss the lxml package's toolset for the
equivalent of SAX (when the document doesn't fit in memory); for
that, refer to the main site:

     http://codespeak.net/lxml/

Best regards,
John Shipman (john@...), Applications Specialist, NM Tech Computer Center,
Speare 119, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john
   ``Let's go outside and commiserate with nature.''  --Dave Farber
_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by Josh English :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I started with DOM processing, but the computer I had back then
couldn't handle XML files larger that 60K or so. (Serious memory
limitations.) So I learned SAX to extract data from my XML, but had to
revert to DOM to put data back in. Since these worked differently, I
ended up creating a DOM element, then passing the XML text back to a
SAX parser. It was ugly, but I got around the file size.

I've replaced everything with ElementTree. I'm on a computer without
memory limitations (well, I haven't tried to parse a file too large
yet), and ElementTree does a good job of extracting information as
well as writing human-readable XML.

I think the ElementTree page has a good tutorial. It's good enough for
me, at least.

The only disadvantage I've run into is my implementation spends a lot
of time parsing and over writing my XML data files, so that overhead
is costly, but the application is small enough, and the network isn't
getting bogged down, so I can get away with it.


I've got some very old examples at this page:
http://www.spiritone.com/~english/code/ims.html

The code isn't well commented, but hey, this is Python  : )


On 2/13/09, Neil Munro <neilmunro@...> wrote:


--
Josh English
Joshua.R.English@...
http://joshenglish.livejournal.com
_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Josh English wrote:
> The only disadvantage I've run into is my implementation spends a lot
> of time parsing and over writing my XML data files, so that overhead
> is costly

Sounds like you should give lxml a try. It's about as fast as cElementTree
on parsing, but several times faster on serialisation.

http://codespeak.net/lxml/performance.html#parsing-and-serialising

Stefan

_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by Bill Kinnersley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John W. Shipman wrote:
>
> This document does not discuss the lxml package's toolset for the
> equivalent of SAX (when the document doesn't fit in memory);

Can anyone add any substance to this remark?  With today's typical
system RAM of 2GB to 3GB, is it even worth consideration any more that a
document might not fit in memory?

Offhand I'd guess the size of the XML file and the size of the DOM tree
would be in the same ballpark.  So unless I've got more than 500MB of
XML to read, I'm clear.  Right or wrong?


_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Bill Kinnersley wrote:
> Can anyone add any substance to this remark?  With today's typical
> system RAM of 2GB to 3GB, is it even worth consideration any more that a
> document might not fit in memory?

At least, it allows you to parse pretty large documents. But think of
parallel handling of more than one document. In that case, you'd still want
to make sure things don't hit the swap disk.


> Offhand I'd guess the size of the XML file and the size of the DOM tree
> would be in the same ballpark.  So unless I've got more than 500MB of
> XML to read, I'm clear.  Right or wrong?

Wrong. Especially the stdlib's minidom is terribly memory hungry. Fredrik
has some benchmarks and memory size hints on his cElementTree page.

http://effbot.org/zone/celementtree.htm#benchmarks

Here are some other benchmarks from Ian Bicking on HTML parsers:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Sadly, I do not know of any direct comparison of lxml.etree and
cElementTree regarding memory usage, but my guess is that cET is still a
bit better than lxml.etree (which is impressively memory friendly already).
A quick comparison for a 3.4MB XML file with a lot of text and very short
tag names (the old testament in English) gave me almost exactly the same
time for parsing. When done, I had a 17MB Python interpreter for lxml.etree
and a 10MB interpreter for cET. Depending on your XML, this may change in
any kind of way, as both optimise their time and memory usage very differently.

For minidom, I get about 60MB, where Fredrik got 80MB. That's still about a
factor of 17-23 compared to the serialised XML file, whereas lxml and cET
end up with a factor of 3-5. Your assumption that you can use a system with
3GB of RAM to parse a 500MB XML file into an in-memory tree can easily turn
wrong for XML files with more tags and shorter text content (say, numbers),
or for documents with non-european languages.

Stefan
_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig

Re: XML processing

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Stefan Behnel wrote:
> For minidom, I get about 60MB, where Fredrik got 80MB. That's still about a
> factor of 17-23 compared to the serialised XML file, whereas lxml and cET
> end up with a factor of 3-5. Your assumption that you can use a system with
> 3GB of RAM to parse a 500MB XML file into an in-memory tree can easily turn
> wrong for XML files with more tags and shorter text content (say, numbers),
> or for documents with non-european languages.

I should add that this was measured on a 32 bit system. 64 bit systems will
require even more memory to store the tree, almost twice as much for each
element.

Stefan

_______________________________________________
XML-SIG maillist  -  XML-SIG@...
http://mail.python.org/mailman/listinfo/xml-sig