html parsing incomplete - bug?

View: New views
7 Messages — Rating Filter:   Alert me  

Parent Message unknown html parsing incomplete - bug?

by Lydia Patrovic :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I have tried parsing a webpage, but unfortunately, the node /html/body is not found.
I used lxml in python, which is based on libxml2.

Firefox does parse the page correctly and if the page is then saved to disc (from firefox), lxml parses it correctly.
If the page is not fetched via firefox but urllib, parsing failes.
The html-source is attached as a zipped txt-file.

Thank you for taking the time, any help is appreciated.

Lydia Patrovic

N.B.:
This is an answer from the lxml mailing list with a diagnosis:

I get the same result with "xmllint --html", so it's definitely a libxml2
problem. It seems to read all  tags and then just stops parsing
without further notice. The next tag would be the  tag, and I
actually suspect this to be a problem:



Note the "main&20090924_2" attribute value, which can be interpreted as an
unterminated entity.

Please report this on the libxml2 mailing list.

Stefan




_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

sccmain.zip (5K) Download Attachment

Parent Message unknown Re: html parsing incomplete - bug?

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Lydia Patrovic wrote:
> Note the "main&20090924_2" attribute value, which can be interpreted as an
> unterminated entity.

:) Nice little Freudian copy&paste quoting error. Here's the line from the
real 'HTML' file:

<script type="text/javascript" src="merge.php?f=main&20090924_2"></script>

Note the unescaped '&' character in the URL.

Stefan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

Re: html parsing incomplete - bug?

by Martin (gzlist) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:

>
> Lydia Patrovic wrote:
>> Note the "main&20090924_2" attribute value, which can be interpreted
>> as an
>> unterminated entity.
>
> :) Nice little Freudian copy&paste quoting error. Here's the line from the
> real 'HTML' file:
>
> <script type="text/javascript" src="merge.php?f=main&20090924_2"></script>
>
> Note the unescaped '&' character in the URL.

I'd have thought the embedded null at byte 532 would be the cause. Try
bytes.replace("\x00", "") before treating it as a c string. Seems to
get the document parsed pretty much as expected for me.

Martin
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

Re: html parsing incomplete - bug?

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Martin (gzlist) wrote:

> On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:
>> Lydia Patrovic wrote:
>>> Note the "main&20090924_2" attribute value, which can be interpreted
>>> as an
>>> unterminated entity.
>> :) Nice little Freudian copy&paste quoting error. Here's the line from the
>> real 'HTML' file:
>>
>> <script type="text/javascript" src="merge.php?f=main&20090924_2"></script>
>>
>> Note the unescaped '&' character in the URL.
>
> I'd have thought the embedded null at byte 532 would be the cause. Try
> bytes.replace("\x00", "") before treating it as a c string. Seems to
> get the document parsed pretty much as expected for me.

Interesting. Sounds totally like the right solution.

I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?

Stefan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

Re: html parsing incomplete - bug?

by Martin (gzlist) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:
>
> I wonder why the parser stops parsing here, though. Is '\0' explicitly
> considered an invalid character in (broken) HTML, or is it really just the
> usual C EOS slip?

It's certainly invalid, though could be recoverable.

In the various html versions: HTML 4 defers to the SGML spec which I'm
not rich enough to consult, XHTML 1 defers to XML which we all know
says nulls are verboten, and the current HTML 5 draft is pretty clear:

<http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>

"All U+0000 NULL characters in the input must be replaced by U+FFFD
REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
error."

(this is all in the context of an decoded-to-unicode stream, not raw
UTF-16 etc.)

Martin
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

Re: html parsing incomplete - bug?

by Daniel Veillard :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:

> On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:
> >
> > I wonder why the parser stops parsing here, though. Is '\0' explicitly
> > considered an invalid character in (broken) HTML, or is it really just the
> > usual C EOS slip?
>
> It's certainly invalid, though could be recoverable.
>
> In the various html versions: HTML 4 defers to the SGML spec which I'm
> not rich enough to consult, XHTML 1 defers to XML which we all know
> says nulls are verboten, and the current HTML 5 draft is pretty clear:
>
> <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>
>
> "All U+0000 NULL characters in the input must be replaced by U+FFFD
> REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
> error."
>
> (this is all in the context of an decoded-to-unicode stream, not raw
> UTF-16 etc.)

  When HTML5 will become a Last Call draft or something then I think it
will make sense to try to update the parser to use the same recovery
tricks.
  Note that the 0 in content may have cut the input at the Python->C
interface layer. But sure libxml2 internals don't like 0 in content.

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel@...  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml

Re: html parsing incomplete - bug?

by Stefan Behnel-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Daniel Veillard wrote:

> On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:
>> On 13/10/2009, Stefan Behnel wrote:
>>> I wonder why the parser stops parsing here, though. Is '\0' explicitly
>>> considered an invalid character in (broken) HTML, or is it really just the
>>> usual C EOS slip?
>> It's certainly invalid, though could be recoverable.
>>
>> In the various html versions: HTML 4 defers to the SGML spec which I'm
>> not rich enough to consult, XHTML 1 defers to XML which we all know
>> says nulls are verboten, and the current HTML 5 draft is pretty clear:
>>
>> <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>
>>
>> "All U+0000 NULL characters in the input must be replaced by U+FFFD
>> REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
>> error."
>>
>> (this is all in the context of an decoded-to-unicode stream, not raw
>> UTF-16 etc.)
>
>   When HTML5 will become a Last Call draft or something then I think it
> will make sense to try to update the parser to use the same recovery
> tricks.

In any case, the parser should either apply the above replacement rule or
report an error when encountering a '\0' byte in the input stream.
Currently, it just silently terminates.


> Note that the 0 in content may have cut the input at the Python->C
> interface layer. But sure libxml2 internals don't like 0 in content.

We also pass UCS4 encoded data though the same code, so, no, that's not an
issue here.

Stefan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@...
http://mail.gnome.org/mailman/listinfo/xml