"almost well-formed documents"

View: New views
3 Messages — Rating Filter:   Alert me  

"almost well-formed documents"

by Kirschner, James :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

The Woodstox homepage lists the following benefit:

 

> There are even many things one can do to support "almost well-formed" documents (like legacy (X)HTML content), or to do alternate non-compliant processing.

 

I haven’t been able to find any more information on this particular aspect of Woodstox and any documentation on this would be very much appreciated. In particular I’m interested in allowing mixed case between start and end tags, as well as unclosed elements such as “<br>”.

 

Does anyone have any experience with this or know of any documentation that might cover this?

 

Thanks!

James


Re: "almost well-formed documents"

by Cowtowncoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Apr 17, 2009 at 10:02 AM, Kirschner, James <jkirsch@...> wrote:

> The Woodstox homepage lists the following benefit:
>
>> There are even many things one can do to support "almost well-formed"
>> documents (like legacy (X)HTML content), or to do alternate non-compliant
>> processing.
>
> I haven’t been able to find any more information on this particular aspect
> of Woodstox and any documentation on this would be very much appreciated. In
> particular I’m interested in allowing mixed case between start and end tags,
> as well as unclosed elements such as “<br>”.
>
> Does anyone have any experience with this or know of any documentation that
> might cover this?

Hi James! (long time no see)

I probably should remove that wording, since there isn't all that much
support really. One thing I do know of is that it is possible to
process content that has multiple root elements, or xml declarations
("fragment" and "multi-doc" modes).

There is no support for missing end tags. It might be easy to add
support for explicit skipping of end tags, that is, when encountering
"<br>", application indicating it does not expect to see matching
"</br>".
But I don't know how useful such feature would be since there's no end
to variations in which xml can be non-well-formed. :)

For real handling of non-xml html, html-to-xml parsers like TagSoup
and JTidy are probably better choices, since they can deal with
problems like this, and expose content as if it was xml to begin with
(adding 'virtual' end tags as need be, recovering from parsing errors
etc).

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



RE: "almost well-formed documents"

by Kirschner, James :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks Tatu! As always, you've been very helpful :) I'll take a look at TagSoup to see what it can bring to the table.

-James

-----Original Message-----
From: Tatu Saloranta [mailto:tsaloranta@...]
Sent: Friday, April 17, 2009 2:37 PM
To: user@...
Subject: Re: [woodstox-user] "almost well-formed documents"

On Fri, Apr 17, 2009 at 10:02 AM, Kirschner, James <jkirsch@...> wrote:

> The Woodstox homepage lists the following benefit:
>
>> There are even many things one can do to support "almost well-formed"
>> documents (like legacy (X)HTML content), or to do alternate non-compliant
>> processing.
>
> I haven't been able to find any more information on this particular aspect
> of Woodstox and any documentation on this would be very much appreciated. In
> particular I'm interested in allowing mixed case between start and end tags,
> as well as unclosed elements such as "<br>".
>
> Does anyone have any experience with this or know of any documentation that
> might cover this?

Hi James! (long time no see)

I probably should remove that wording, since there isn't all that much
support really. One thing I do know of is that it is possible to
process content that has multiple root elements, or xml declarations
("fragment" and "multi-doc" modes).

There is no support for missing end tags. It might be easy to add
support for explicit skipping of end tags, that is, when encountering
"<br>", application indicating it does not expect to see matching
"</br>".
But I don't know how useful such feature would be since there's no end
to variations in which xml can be non-well-formed. :)

For real handling of non-xml html, html-to-xml parsers like TagSoup
and JTidy are probably better choices, since they can deal with
problems like this, and expose content as if it was xml to begin with
(adding 'virtual' end tags as need be, recovering from parsing errors
etc).

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email