|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
pdftohtml produces invalid XML
Hi!
I started using pdftohtml form Debian's poppler-utils package for document analysis and run across a problem that `pdftohtml -xml' can produce invalid XML on output (at least invalid for python xml tools). Test case: the problematic character is \x11# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python -c 'from xml.parsers.expat import ParserCreate; ParserCreate().ParseFile(open("x.xml"))'Page-1 Traceback (most recent call last): File "<string>", line 2, in <module> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column 63 I'm running version 0.12 of pdftohtml, installed from Debian poppler-utils_0.12.0-2_i386 package. pdftohtml -v pdftohtml version 0.12.0 Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch Copyright 1996-2004 Glyph & Cog, LLC how can i workaround this problem? best regards, Piotr Findeisen _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml produces invalid XMLA Dimarts, 3 de novembre de 2009, Piotr Findeisen va escriure:
> Hi! > > I started using pdftohtml form Debian's poppler-utils package for > document analysis and run across a problem that `pdftohtml -xml' can > produce invalid XML on output (at least invalid for python xml tools). > > Test case: > > # wget -q > http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ > pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python > -c 'from xml.parsers.expat import ParserCreate; > ParserCreate().ParseFile(open("x.xml"))' > > Page-1 > Traceback (most recent call last): > File "<string>", line 2, in <module> > xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, > column 63 > > the problematic character is \x11 > > I'm running version 0.12 of pdftohtml, installed from Debian > poppler-utils_0.12.0-2_i386 package. > > pdftohtml -v > pdftohtml version 0.12.0 > Copyright 2005-2009 The Poppler Developers - > http://poppler.freedesktop.org Copyright 1999-2003 Gueorgui Ovtcharov and > Rainer Dorsch > Copyright 1996-2004 Glyph & Cog, LLC Can you please post a but at bugs.freedesktop.org? > how can i workaround this problem? You can code a patch or wait until someone fixes it. Albert > best regards, > Piotr Findeisen > _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml produces invalid XML2009/11/3 Piotr Findeisen <piotr.findeisen@...>:
> Hi! > > I started using pdftohtml form Debian's poppler-utils package for document > analysis and run across a problem that `pdftohtml -xml' can produce invalid > XML on output (at least invalid for python xml tools). > > Test case: > > # wget -q > http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ > pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ > python -c 'from xml.parsers.expat import ParserCreate; > ParserCreate().ParseFile(open("x.xml"))' I'm not sure what the fix is, but the line with the error is: <text top="632" left="152" width="58" height="0" font="7">¥§¦©¨¥§¦¨ ¦</text> and firefox gives: <text top="632" left="152" width="58" height="0" font="7">¥§¦©¨¥§¦¨ ¦</text> ---------------------------------------------------------------^ (that is -- it is choking on the [00|11] character; there are also other chatacters in the latin-1 control character range (c < 0x20)). This will cause any xml parser to choke, as the characters are invalid. What I don't know is why/how these are appearing in pdftohtml. Looking at the PDF in okular (which appears to render the PDF correctly there), shows a mathematical equation for the faulty lines, specifically: <text top="606" left="101" width="173" height="10" font="6">Digital signal processing basic formula:</text> <text top="632" left="101" width="25" height="10" font="6">y(t) =</text> <text top="626" left="133" width="0" height="0" font="7"> </text> <text top="631" left="133" width="0" height="0" font="7">¡</text> <text top="647" left="128" width="0" height="0" font="7">¢</text> <text top="646" left="134" width="11" height="0" font="7"> ¤£</text> <text top="632" left="152" width="58" height="0" font="7">¥§¦©¨¥§¦¨ ¦</text> should be (in the proper math layout for this formula): y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du where the h(u)x(t - u)du bit is in the stylised script used in maths. My initial thought is that the characters are referencing the Unicode codepoints (e.g. in the U+2100 range). However, these all appear to be in the ascii range (i.e. not multi-byte UTF-8 as the encoding suggests, but I may be wrong as there look to be more characters than what is displayed). Instead, they look like they are codepoints into a special mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows box to hand at the moment, so can't verify the font name)). This would make sense given the font="7" attribute and the seemingly random characters. And given the greater number of characters, this looks to be using a non-URF8 multi-byte encoding. Someone will need to dig around in the htmltopdf code and the rendering of non-ascii characters. HTH, - Reece _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml produces invalid XML2009/11/3 Reece Dunn <msclrhd@...>:
> 2009/11/3 Piotr Findeisen <piotr.findeisen@...>: >> Hi! >> >> I started using pdftohtml form Debian's poppler-utils package for document >> analysis and run across a problem that `pdftohtml -xml' can produce invalid >> XML on output (at least invalid for python xml tools). >> >> Test case: >> >> # wget -q >> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \ >> pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ >> python -c 'from xml.parsers.expat import ParserCreate; >> ParserCreate().ParseFile(open("x.xml"))' > > I'm not sure what the fix is, but the line with the error is: > <text top="632" left="152" width="58" height="0" > font="7">¥§¦©¨ ¥§ ¦ ¨ ¦</text> > and firefox gives: > <text top="632" left="152" width="58" height="0" > font="7">¥§¦©¨ ¥§ ¦ ¨ ¦</text> > ---------------------------------------------------------------^ > (that is -- it is choking on the [00|11] character; there are also > other chatacters in the latin-1 control character range (c < 0x20)). > > This will cause any xml parser to choke, as the characters are > invalid. What I don't know is why/how these are appearing in > pdftohtml. > > Looking at the PDF in okular (which appears to render the PDF > correctly there), shows a mathematical equation for the faulty lines, > specifically: > > <text top="606" left="101" width="173" height="10" font="6">Digital > signal processing basic formula:</text> > <text top="632" left="101" width="25" height="10" font="6">y(t) =</text> > <text top="626" left="133" width="0" height="0" font="7"> </text> > <text top="631" left="133" width="0" height="0" font="7">¡</text> > <text top="647" left="128" width="0" height="0" font="7">¢</text> > <text top="646" left="134" width="11" height="0" font="7"> ¤£</text> > <text top="632" left="152" width="58" height="0" > font="7">¥§¦©¨ ¥§ ¦ ¨ ¦</text> > > should be (in the proper math layout for this formula): > > y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du > > where the h(u)x(t - u)du bit is in the stylised script used in maths. > > My initial thought is that the characters are referencing the Unicode > codepoints (e.g. in the U+2100 range). However, these all appear to be > in the ascii range (i.e. not multi-byte UTF-8 as the encoding > suggests, but I may be wrong as there look to be more characters than > what is displayed). > > Instead, they look like they are codepoints into a special > mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows > box to hand at the moment, so can't verify the font name)). This would > make sense given the font="7" attribute and the seemingly random > characters. And given the greater number of characters, this looks to > be using a non-URF8 multi-byte encoding. > > Someone will need to dig around in the htmltopdf code and the > rendering of non-ascii characters. As a follow-up... Not using the -xml option of pdftotext causes it to write a html file that is similarly mangled w.r.t. the characters in the formula (from the integral to the du differential component). In addition to this, the layout does not match the formula for the integral (not sure whether the ¢ is meant to be the integral sign or not; if it is supposed to be the infinity sign, it should be above the integral, not below it) and the font size is not consistent with the "y(t) = " part. These rendering issues are obviously orthogonal to the encoding issue. - Reece _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml produces invalid XML
Hello!
On 03.11.2009 23:38, Reece Dunn wrote: Right. 0x11 is the first one to cause problem with python xml parser. these problematic characters are all ASCII control charactersMy initial thought is that the characters are referencing the Unicode codepoints (e.g. in the U+2100 range). However, these all appear to be in the ascii range (i.e. not multi-byte UTF-8 as the encoding suggests, but I may be wrong as there look to be more characters than what is displayed). font="7" attribute is generated by "pdftohtml -xml" and it's reference toInstead, they look like they are codepoints into a special mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows box to hand at the moment, so can't verify the font name)). This would make sense given the font="7" attribute and the seemingly random characters. And given the greater number of characters, this looks to be using a non-URF8 multi-byte encoding. <font id="7" ...... /> element near the top of the produced XML document And yes, there is some font mapping involved. I tried and wrote the equation in a new .tex document, but produced PDF contained only characters I know & read. No matter how i produced PDF — pdflatex, latex & dvipdf, etc. I agree this is where the problem begins, though I've never seen pdftohtml's source...Someone will need to dig around in the htmltopdf code and the rendering of non-ascii characters. best regards, Piotr _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml produces invalid XMLOn 03.11.2009 22:43, Albert Astals Cid wrote: > Can you please post a but at bugs.freedesktop.org? Here it is https://bugs.freedesktop.org/show_bug.cgi?id=24890 thanks, Piotr _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
| Free embeddable forum powered by Nabble | Forum Help |