|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
|
|
|
Re: html parsing incomplete - bug?On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:
> > Lydia Patrovic wrote: >> Note the "main&20090924_2" attribute value, which can be interpreted >> as an >> unterminated entity. > > :) Nice little Freudian copy&paste quoting error. Here's the line from the > real 'HTML' file: > > <script type="text/javascript" src="merge.php?f=main&20090924_2"></script> > > Note the unescaped '&' character in the URL. I'd have thought the embedded null at byte 532 would be the cause. Try bytes.replace("\x00", "") before treating it as a c string. Seems to get the document parsed pretty much as expected for me. Martin _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@... http://mail.gnome.org/mailman/listinfo/xml |
|
|
Re: html parsing incomplete - bug?Martin (gzlist) wrote: > On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote: >> Lydia Patrovic wrote: >>> Note the "main&20090924_2" attribute value, which can be interpreted >>> as an >>> unterminated entity. >> :) Nice little Freudian copy&paste quoting error. Here's the line from the >> real 'HTML' file: >> >> <script type="text/javascript" src="merge.php?f=main&20090924_2"></script> >> >> Note the unescaped '&' character in the URL. > > I'd have thought the embedded null at byte 532 would be the cause. Try > bytes.replace("\x00", "") before treating it as a c string. Seems to > get the document parsed pretty much as expected for me. Interesting. Sounds totally like the right solution. I wonder why the parser stops parsing here, though. Is '\0' explicitly considered an invalid character in (broken) HTML, or is it really just the usual C EOS slip? Stefan _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@... http://mail.gnome.org/mailman/listinfo/xml |
|
|
Re: html parsing incomplete - bug?On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote:
> > I wonder why the parser stops parsing here, though. Is '\0' explicitly > considered an invalid character in (broken) HTML, or is it really just the > usual C EOS slip? It's certainly invalid, though could be recoverable. In the various html versions: HTML 4 defers to the SGML spec which I'm not rich enough to consult, XHTML 1 defers to XML which we all know says nulls are verboten, and the current HTML 5 draft is pretty clear: <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream> "All U+0000 NULL characters in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse error." (this is all in the context of an decoded-to-unicode stream, not raw UTF-16 etc.) Martin _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@... http://mail.gnome.org/mailman/listinfo/xml |
|
|
Re: html parsing incomplete - bug?On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:
> On 13/10/2009, Stefan Behnel <stefan_ml@...> wrote: > > > > I wonder why the parser stops parsing here, though. Is '\0' explicitly > > considered an invalid character in (broken) HTML, or is it really just the > > usual C EOS slip? > > It's certainly invalid, though could be recoverable. > > In the various html versions: HTML 4 defers to the SGML spec which I'm > not rich enough to consult, XHTML 1 defers to XML which we all know > says nulls are verboten, and the current HTML 5 draft is pretty clear: > > <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream> > > "All U+0000 NULL characters in the input must be replaced by U+FFFD > REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse > error." > > (this is all in the context of an decoded-to-unicode stream, not raw > UTF-16 etc.) When HTML5 will become a Last Call draft or something then I think it will make sense to try to update the parser to use the same recovery tricks. Note that the 0 in content may have cut the input at the Python->C interface layer. But sure libxml2 internals don't like 0 in content. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@... | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@... http://mail.gnome.org/mailman/listinfo/xml |
|
|
Re: html parsing incomplete - bug?Daniel Veillard wrote: > On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote: >> On 13/10/2009, Stefan Behnel wrote: >>> I wonder why the parser stops parsing here, though. Is '\0' explicitly >>> considered an invalid character in (broken) HTML, or is it really just the >>> usual C EOS slip? >> It's certainly invalid, though could be recoverable. >> >> In the various html versions: HTML 4 defers to the SGML spec which I'm >> not rich enough to consult, XHTML 1 defers to XML which we all know >> says nulls are verboten, and the current HTML 5 draft is pretty clear: >> >> <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream> >> >> "All U+0000 NULL characters in the input must be replaced by U+FFFD >> REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse >> error." >> >> (this is all in the context of an decoded-to-unicode stream, not raw >> UTF-16 etc.) > > When HTML5 will become a Last Call draft or something then I think it > will make sense to try to update the parser to use the same recovery > tricks. In any case, the parser should either apply the above replacement rule or report an error when encountering a '\0' byte in the input stream. Currently, it just silently terminates. > Note that the 0 in content may have cut the input at the Python->C > interface layer. But sure libxml2 internals don't like 0 in content. We also pass UCS4 encoded data though the same code, so, no, that's not an issue here. Stefan _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@... http://mail.gnome.org/mailman/listinfo/xml |
| Free embeddable forum powered by Nabble | Forum Help |