best xml parser to use

View: New views
14 Messages — Rating Filter:   Alert me  

best xml parser to use

by petera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or simply writing a java program using indexOf, in terms of performance ?


TIA Peter

RE: [xml-dev] best xml parser to use

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 
>
> What is likely to be my best approach, DOM (unlikely I
> guess), SAX, StAX or simply writing a java program using
> indexOf, in terms of performance ?
>

Your performance, or the machine's performance?

How big is the input file?

Michael Kay
http://www.saxonica.com/


Re: [xml-dev] best xml parser to use

by Bob DuCharme :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I don't know much about StAX, but if you'll be processing the data
linearly (i.e. with no need to rearrange it as part of your processing)
SAX should be fine and quick.

Bob

On Wed, September 6, 2006 9:09 am, petera wrote:

>
> Hi
>
> I have a particular problem to solve:
>
> I have an xml batch file that contains individual xml invoices. I need to
> extract these xml invoices one at a time and
> place them on a message queue i.e. I just need to get all the data between
> the invoice start and end tags put it
> in a string and place it on a message queue (validation occurs on the
> invoice itself on the receiver side).
>
> What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX
> or
> simply writing a java program using indexOf, in terms of performance ?
>
>
> TIA Peter
>
> --
> View this message in context:
> http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
> Sent from the Xml.org Dev forum at Nabble.com.
>
>


RE: [xml-dev] best xml parser to use

by petera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Michael,

the machines performance.

the input files could be quite large over 1mb and arriving from different sources.

Peter

Michael Kay wrote:
 
>
> What is likely to be my best approach, DOM (unlikely I
> guess), SAX, StAX or simply writing a java program using
> indexOf, in terms of performance ?
>

Your performance, or the machine's performance?

How big is the input file?

Michael Kay
http://www.saxonica.com/

RE: [xml-dev] best xml parser to use

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
> the machines performance.
>
> the input files could be quite large over 1mb and arriving
> from different sources.
>

Well, only you know the performance requirements, but for a file as small as
1Mb many people would do the job in XSLT. It's not the fastest option, but
my guess would be that it's probably capable of doing the job, and you'll be
left with something that's easier to maintain.

Michael Kay
http://www.saxonica.com/



Parent Message unknown RE: [xml-dev] best xml parser to use

by Spies, Brennan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html



-----Original Message-----
From: petera [mailto:peter.anderson@...]
Sent: Wednesday, September 06, 2006 6:10 AM
To: xml-dev@...
Subject: [xml-dev] best xml parser to use


Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to
extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between
the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the
invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or
simply writing a java program using indexOf, in terms of performance ?


TIA Peter

--
View this message in context:
http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
Sent from the Xml.org Dev forum at Nabble.com.


Re: [xml-dev] best xml parser to use

by K. W. Landry :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If you're coding in Java I'd suggest xmlbeans. I've found xmlbeans fast, easy, quick to employ; very handy in about half a dozen projects now.
 
You need to compile the schema which returns java code that will then allow you to directly reference any element. Then, simply reference the invoice structure's topmost element, and then do as you wish, either write the xml to the queue as simple text, or create a new xml document (just provides the xml header at the start of the file) and add only this copied element to it and write to the queue, or strip all or selected, etc..., etc..., and write to the queue and iterate to the next invoice or batch file. It could be 20 lines of code tops.
 
If you don't have a schema to feed into the schema compiler there are a couple of tools that you can build a schema and a couple that infer schema from sample xml.
 
KWL

 
On 9/6/06, petera <peter.anderson@...> wrote:

Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to
extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between
the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the
invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or
simply writing a java program using indexOf, in terms of performance ?


TIA Peter

--
View this message in context: http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
Sent from the Xml.org Dev forum at Nabble.com.



Re: [xml-dev] best xml parser to use

by justinedelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
I'd go with a DOM-style API (either DOM, JDOM, dom4j, or XOM). The files aren't that big. Using SAX or StAX will require that you build up the extracted string manually, whereas with DOM the process looks like:

1) parse document
2) loop through invoice child elements
3) serialize each child element to a String and post

You could use xmlbeans or another data binding framework, but then you're just subsituting the generic DOM data model for a schema-specific data model. It doesn't sound like you care about the internal structure and content of an invoice, so generating the new Java classes required for data binding is unnecessary overhead.

On 9/6/06, K. W. Landry <kwlandry@...> wrote:
If you're coding in Java I'd suggest xmlbeans. I've found xmlbeans fast, easy, quick to employ; very handy in about half a dozen projects now.
 
You need to compile the schema which returns java code that will then allow you to directly reference any element. Then, simply reference the invoice structure's topmost element, and then do as you wish, either write the xml to the queue as simple text, or create a new xml document (just provides the xml header at the start of the file) and add only this copied element to it and write to the queue, or strip all or selected, etc..., etc..., and write to the queue and iterate to the next invoice or batch file. It could be 20 lines of code tops.
 
If you don't have a schema to feed into the schema compiler there are a couple of tools that you can build a schema and a couple that infer schema from sample xml.
 
KWL

 
On 9/6/06, petera <peter.anderson@...> wrote:

Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to
extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between
the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the
invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or
simply writing a java program using indexOf, in terms of performance ?


TIA Peter

--
View this message in context: http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
Sent from the Xml.org Dev forum at Nabble.com.




Re: [xml-dev] best xml parser to use

by K. W. Landry :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
I'd agree with Justin on these points, there is excess that can be avoided. The unused classes generated by xmlbeans would simply sit idly by as the handful of classes that are necessary do all the heavy lifting.
 
The finer point I would add however, is that the generation of the code is a one time effort. However, that only holds as long as the schema, or structure, of the xml you're processing never changes. 
 
One point of discussion I'll add is that I believe the coding necessary to accomplish the task described would be most efficient in that the xmlbeans implementation will do alot of the heavy lifting for what you need to do, without you having to code many details explicitly as I believe you would need to do with the strictly DOM focused approach.
 
A big Caveat however, is that I haven't used XOM, and JDOM and dom4j only minimally compared to xmlbeans, I'm sure others out there have more experience and can weigh in on that thought.
 
Overall, however, I believe the leanest and meanest approach for this in terms of performance and resource consumption is a stax implementation.  
 
KWL
 

 
On 9/6/06, Justin Edelson <justinedelson@...> wrote:
I'd go with a DOM-style API (either DOM, JDOM, dom4j, or XOM). The files aren't that big. Using SAX or StAX will require that you build up the extracted string manually, whereas with DOM the process looks like:

1) parse document
2) loop through invoice child elements
3) serialize each child element to a String and post

You could use xmlbeans or another data binding framework, but then you're just subsituting the generic DOM data model for a schema-specific data model. It doesn't sound like you care about the internal structure and content of an invoice, so generating the new Java classes required for data binding is unnecessary overhead.


On 9/6/06, K. W. Landry <kwlandry@... > wrote:
If you're coding in Java I'd suggest xmlbeans. I've found xmlbeans fast, easy, quick to employ; very handy in about half a dozen projects now.
 
You need to compile the schema which returns java code that will then allow you to directly reference any element. Then, simply reference the invoice structure's topmost element, and then do as you wish, either write the xml to the queue as simple text, or create a new xml document (just provides the xml header at the start of the file) and add only this copied element to it and write to the queue, or strip all or selected, etc..., etc..., and write to the queue and iterate to the next invoice or batch file. It could be 20 lines of code tops.
 
If you don't have a schema to feed into the schema compiler there are a couple of tools that you can build a schema and a couple that infer schema from sample xml.
 
KWL

 
On 9/6/06, petera <peter.anderson@...> wrote:

Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to
extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between
the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the
invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or
simply writing a java program using indexOf, in terms of performance ?


TIA Peter

--
View this message in context: http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
Sent from the Xml.org Dev forum at Nabble.com.





Re: best xml parser to use

by petera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Thanks to all who replied to my e-mail.

The approach I have decided upon is to try two implementations:

1. XSLT

2. StAX

XSLT is really the simplest but my bosses might not like the memory requirements so StAX would a good alternative

The URL provided by Brennan is excellent on StAX: http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html 

Peter

petera wrote:
Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or simply writing a java program using indexOf, in terms of performance ?


TIA Peter

Re: [xml-dev] best xml parser to use

by Stefan Tilkov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sep 6, 2006, at 10:08 PM, K. W. Landry wrote:

If you're coding in Java I'd suggest xmlbeans. I've found xmlbeans fast, easy, quick to employ; very handy in about half a dozen projects now.
 

This suggestions seems to be extreme overkill as you're not even interested in the contents of the invoices.
In terms of performance, I would expect StAX and SAX to be roughly equal. Something hardcoded that is not XML-aware will be a lot faster, but much more error-prone.

Stefan
--



You need to compile the schema which returns java code that will then allow you to directly reference any element. Then, simply reference the invoice structure's topmost element, and then do as you wish, either write the xml to the queue as simple text, or create a new xml document (just provides the xml header at the start of the file) and add only this copied element to it and write to the queue, or strip all or selected, etc..., etc..., and write to the queue and iterate to the next invoice or batch file. It could be 20 lines of code tops.
 
If you don't have a schema to feed into the schema compiler there are a couple of tools that you can build a schema and a couple that infer schema from sample xml.
 
KWL

 
On 9/6/06, petera <peter.anderson@...> wrote:

Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to
extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between
the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the
invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or
simply writing a java program using indexOf, in terms of performance ?


TIA Peter

--
View this message in context: http://www.nabble.com/best-xml-parser-to-use-tf2226882.html#a6171113
Sent from the Xml.org Dev forum at Nabble.com.




Re: best xml parser to use

by petera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


sorry I meant thread !!



Thanks to all who replied to my e-mail.

The approach I have decided upon is to try two implementations:

1. XSLT

2. StAX

XSLT is really the simplest but my bosses might not like the memory requirements so StAX would a good alternative

The URL provided by Brennan is excellent on StAX: http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html 

Peter

petera wrote:
Hi

I have a particular problem to solve:

I have an xml batch file that contains individual xml invoices. I need to extract these xml invoices one at a time and
place them on a message queue i.e. I just need to get all the data between the invoice start and end tags put it
in a string and place it on a message queue (validation occurs on the invoice itself on the receiver side).

What is likely to be my best approach, DOM (unlikely I guess), SAX, StAX or simply writing a java program using indexOf, in terms of performance ?


TIA Peter


Re: [xml-dev] Re: best xml parser to use

by Tatu Saloranta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- petera <peter.anderson@...> wrote:

>
>
> Thanks to all who replied to my e-mail.
>
> The approach I have decided upon is to try two
> implementations:
>
> 1. XSLT
>
> 2. StAX

I would agree with these main choices. While you could
use data binding (xmlbeans, jaxb2), that seems bit
heavy-weight route since you don't care about type
mappings etc, so most of the work would be overhead.
SAX could of course be used, but I don't know of many
benefits over Stax for this use case.

>
> XSLT is really the simplest but my bosses might not
> like the memory
> requirements so StAX would a good alternative
>
> The URL provided by Brennan is excellent on StAX:
>
http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html

In addition, if you decide to go Stax route (which may
make sense if you will eventually get bigger files --
1 M is still with about any solution, unless there's
lots of concurrent processing), you may want to check
out stax-utils project

https://stax-utils.dev.java.net/

since 'raw' Stax API is bit of PITA to use for many
tasks. For copying xml, using Event API it is quite
straight-forward.

You could also try out StaxMate that I wrote
(http://woodstox.codehaus.org/StaxMate) which has
support for accessing xml content in streaming way,
but still allowing hierarchic traversal (in forward
direction).
Documentation is bit sparse, best way may be to read
entries at
(http://www.cowtowncoder.com/blog/blog.html). I should
write sample code for this particular use case though,
since it seems to be kind of recurring question
(usually on stax_builders list though), and it should
also be easy to add sub-tree pass-through copy
operation, so that this particular task would be just
couple of lines total.

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

Re: [xml-dev] best xml parser to use

by Tatu Saloranta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- Stefan Tilkov <info@...> wrote:

> On Sep 6, 2006, at 10:08 PM, K. W. Landry wrote:
>
...
> In terms of performance, I would expect StAX and SAX
> to be roughly  
> equal. Something hardcoded that is not XML-aware
> will be a lot  
> faster, but much more error-prone.

Latter yes, former not necessarily. What I found out
was that reading a stream using JDK BufferedReader(),
reading line by line, was slightly slower than parsing
content as XML. Your mileage may vary, but regular
stream parsing is surprisingly fast nowadays (30 - 40
MBps on typical desktop machines),

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com