|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
How to grab all content as-is within an element?Hello, I need to grab all the content within an element "as is" and was hoping to leverage woodstox to do it. As an example, my XML may look like this <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <Notify xmlns="http://docs.oasis-open.org/wsn/b-2" xmlns:ns2="http://www.w3.org/2005/08/addressing" xmlns:ns3="http://docs.oasis-open.org/wsrf/bf-2" xmlns:ns4="http://docs.oasis-open.org/wsn/t-1" xmlns:ns5="http://docs.oasis-open.org/wsn/br-2" xmlns:ns6="http://docs.oasis-open.org/wsrf/r-2"> <NotificationMessage> <Topic Dialect="http://docs.oasis-open.org/wsn/t-1/TopicExpression/Simple">myTopic</Topic> <Message> <hello:hello xmlns:hello="http://com.acme" xmlns="http://com.acme">hello world</hello:hello> </Message> </NotificationMessage> </Notify> </soap:Body> </soap:Envelope> I want to extract the content as a String from <hello:hello> to </hello:hello> inclusive. The content is required to be XML but may be any XML. I want to grab the content as-is, with the exact namespaces, etc. used in the original content. I have used woodstox extensively to grab values (i.e. getText()) from my XML but am having trouble figuring out how to grab content when it is in fact XML. Is there a way to tell woodstox to get me everything as-is between this START_ELEMENT and this END_ELEMENT? Thanks in advance! JM --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
Re: How to grab all content as-is within an element?On Mon, May 11, 2009 at 10:58 AM, Jon Miller <jhmiller001@...> wrote:
> ... > I want to extract the content as a String from <hello:hello> to </hello:hello> inclusive. The content is required to be XML but may be any XML. I want to grab the content as-is, with the exact namespaces, etc. used in the original content. Just to make sure: contents as text, as if it wasn't parsed at all? > I have used woodstox extensively to grab values (i.e. getText()) from my XML but am having trouble figuring out how to grab content when it is in fact XML. Is there a way to tell woodstox to get me everything as-is between this START_ELEMENT and this END_ELEMENT? There is no way to do this directly: all content is parsed, and although underlying text is obviously temporarily stored in buffers for parsing, but there is no guarantee that all of it would remain, even for a single event (for example, when element name is split across buffer boundary). One way you could achieve this would be to use Location information (starting location for <hello:hello>; ending location for </hello:hello>). Woodstox does keep exact track of locations (minus possible bugs). But there is no functionality that would directly support such use case. This is one use case where alternative like VTD-XML might make sense -- it does treat xml text as the exact source, and does not do any entity replacements or real parsing (IMO), which would actually be a benefit in this case. Its API is bit acquired taste, but would allow doing what you want. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
Re: How to grab all content as-is within an element?Thank you for the quick response Tatu. To answer your question, I would like to process the contents between those tags as text, as though nothing were parsed at all. To get up to that tag however, something like woodstox is perfect. I had looked into the getCharacterOffset() approach. I was guessing that I would then have to process the stream of data twice however, once to get the offsets and the second time to actually pick out the section I was interested in using those offsets? Is that correct? I just took a peak at VTD-XML. It appears that they are licensed as GPL which may cause some problems for us. Do you have any pointers to other stream based parsers that might do what I am looking for? While I could do something simple with strings and regular expressions I was hoping to use a library that is namespace aware. Thanks again for your help! Best, JM ----- Original Message ---- From: Tatu Saloranta <tsaloranta@...> To: user@... Sent: Monday, May 11, 2009 2:30:51 PM Subject: Re: [woodstox-user] How to grab all content as-is within an element? On Mon, May 11, 2009 at 10:58 AM, Jon Miller <jhmiller001@...> wrote: > ... > I want to extract the content as a String from <hello:hello> to </hello:hello> inclusive. The content is required to be XML but may be any XML. I want to grab the content as-is, with the exact namespaces, etc. used in the original content. Just to make sure: contents as text, as if it wasn't parsed at all? > I have used woodstox extensively to grab values (i.e. getText()) from my XML but am having trouble figuring out how to grab content when it is in fact XML. Is there a way to tell woodstox to get me everything as-is between this START_ELEMENT and this END_ELEMENT? There is no way to do this directly: all content is parsed, and although underlying text is obviously temporarily stored in buffers for parsing, but there is no guarantee that all of it would remain, even for a single event (for example, when element name is split across buffer boundary). One way you could achieve this would be to use Location information (starting location for <hello:hello>; ending location for </hello:hello>). Woodstox does keep exact track of locations (minus possible bugs). But there is no functionality that would directly support such use case. This is one use case where alternative like VTD-XML might make sense -- it does treat xml text as the exact source, and does not do any entity replacements or real parsing (IMO), which would actually be a benefit in this case. Its API is bit acquired taste, but would allow doing what you want. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
Re: How to grab all content as-is within an element?On Mon, May 11, 2009 at 12:20 PM, Jon Miller <jhmiller001@...> wrote:
> > Thank you for the quick response Tatu. You are welcome! > To answer your question, I would like to process the contents between > those tags as text, as though nothing were parsed at all. Sounds a bit like "<script>" tag is handled in HTML. > To get up to > that tag however, something like woodstox is perfect. :-) > I had looked into the getCharacterOffset() approach. I was guessing that I would then have to process the stream of data twice however, > once to get the offsets and the second time to actually pick out the section I was interested in using those offsets? Is that correct? Yes, that is correct. It gets bit tricky most because although theoretically it should be possible to only buffer parts you need, it is hard to figure out exact location, to reconcile offsets. That is, only enable additional buffering on START_ELEMENT, disable it on END_ELEMENT. If you have an underlying Reader, you could keep track of input buffer using an intermediate Reader. It is quite easy to track number of characters preceding buffers had (add current number of characters in buffer to total before reading more), and when START_ELEMENT is returned, it is known that the start of child content is somewhere within current buffer (or ends right after buffer), you could enable collection at that point. After this you could skip all events up until END_ELEMENT, and when getting there have all content up to and include END_ELEMENT; then just need to remove the end tag as per offset. It is not very simple, but doable. Content would be fully parsed, but if it is well-formed xml that wouldn't be a problem > > I just took a peak at VTD-XML. It appears that they are licensed as GPL which may cause some problems for us. Do you have any pointers to other stream based parsers that might do what I am looking for? (plus, VTD-XML is not a streaming parser either) I think XmlPull implementations (xpp3, kxml) actually provide some support for getting underlying textual representation as is. I remember there being a property in XmlPull API; so implementations that support it (support is optional) would work. So perhaps check out xpp3? > While I could do something simple with strings and regular expressions I was hoping to use a library that is namespace aware. Yes, makes sense. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
| Free embeddable forum powered by Nabble | Forum Help |