|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
"almost well-formed documents"The Woodstox homepage lists the following benefit: > There are even many things one can do to support
"almost well-formed" documents (like legacy (X)HTML content), or to
do alternate non-compliant processing. I haven’t been able to find
any more information on this particular aspect of Woodstox and any documentation
on this would be very much appreciated. In particular I’m interested in
allowing mixed case between start and end tags, as well as unclosed elements
such as “<br>”. Does anyone have any experience with this or know of any
documentation that might cover this? Thanks! James |
|
|
Re: "almost well-formed documents"On Fri, Apr 17, 2009 at 10:02 AM, Kirschner, James <jkirsch@...> wrote:
> The Woodstox homepage lists the following benefit: > >> There are even many things one can do to support "almost well-formed" >> documents (like legacy (X)HTML content), or to do alternate non-compliant >> processing. > > I haven’t been able to find any more information on this particular aspect > of Woodstox and any documentation on this would be very much appreciated. In > particular I’m interested in allowing mixed case between start and end tags, > as well as unclosed elements such as “<br>”. > > Does anyone have any experience with this or know of any documentation that > might cover this? Hi James! (long time no see) I probably should remove that wording, since there isn't all that much support really. One thing I do know of is that it is possible to process content that has multiple root elements, or xml declarations ("fragment" and "multi-doc" modes). There is no support for missing end tags. It might be easy to add support for explicit skipping of end tags, that is, when encountering "<br>", application indicating it does not expect to see matching "</br>". But I don't know how useful such feature would be since there's no end to variations in which xml can be non-well-formed. :) For real handling of non-xml html, html-to-xml parsers like TagSoup and JTidy are probably better choices, since they can deal with problems like this, and expose content as if it was xml to begin with (adding 'virtual' end tags as need be, recovering from parsing errors etc). -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
RE: "almost well-formed documents"Thanks Tatu! As always, you've been very helpful :) I'll take a look at TagSoup to see what it can bring to the table.
-James -----Original Message----- From: Tatu Saloranta [mailto:tsaloranta@...] Sent: Friday, April 17, 2009 2:37 PM To: user@... Subject: Re: [woodstox-user] "almost well-formed documents" On Fri, Apr 17, 2009 at 10:02 AM, Kirschner, James <jkirsch@...> wrote: > The Woodstox homepage lists the following benefit: > >> There are even many things one can do to support "almost well-formed" >> documents (like legacy (X)HTML content), or to do alternate non-compliant >> processing. > > I haven't been able to find any more information on this particular aspect > of Woodstox and any documentation on this would be very much appreciated. In > particular I'm interested in allowing mixed case between start and end tags, > as well as unclosed elements such as "<br>". > > Does anyone have any experience with this or know of any documentation that > might cover this? Hi James! (long time no see) I probably should remove that wording, since there isn't all that much support really. One thing I do know of is that it is possible to process content that has multiple root elements, or xml declarations ("fragment" and "multi-doc" modes). There is no support for missing end tags. It might be easy to add support for explicit skipping of end tags, that is, when encountering "<br>", application indicating it does not expect to see matching "</br>". But I don't know how useful such feature would be since there's no end to variations in which xml can be non-well-formed. :) For real handling of non-xml html, html-to-xml parsers like TagSoup and JTidy are probably better choices, since they can deal with problems like this, and expose content as if it was xml to begin with (adding 'virtual' end tags as need be, recovering from parsing errors etc). -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
| Free embeddable forum powered by Nabble | Forum Help |