pdftohtml produces invalid XML

View: New views
6 Messages — Rating Filter:   Alert me  

pdftohtml produces invalid XML

by Piotr Findeisen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

I started using pdftohtml form Debian's poppler-utils package for document analysis and run across a problem that `pdftohtml -xml' can produce invalid XML on output (at least invalid for python xml tools).

Test case:
# wget -q http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
    pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
    python -c 'from xml.parsers.expat import ParserCreate; ParserCreate().ParseFile(open("x.xml"))'
Page-1
Traceback (most recent call last):
  File "<string>", line 2, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45, column 63
the problematic character is \x11

I'm running version 0.12 of pdftohtml, installed from Debian poppler-utils_0.12.0-2_i386 package.
pdftohtml -v
pdftohtml version 0.12.0
Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2004 Glyph & Cog, LLC
  

how can i workaround this problem?
best regards,
Piotr Findeisen


_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

signature.asc (261 bytes) Download Attachment

Re: pdftohtml produces invalid XML

by Albert Astals Cid-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A Dimarts, 3 de novembre de 2009, Piotr Findeisen va escriure:

> Hi!
>
> I started using pdftohtml form Debian's poppler-utils package for
> document analysis and run across a problem that `pdftohtml -xml' can
> produce invalid XML on output (at least invalid for python xml tools).
>
> Test case:
>
>     # wget -q
>  http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>  pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \ python
>  -c 'from xml.parsers.expat import ParserCreate;
>  ParserCreate().ParseFile(open("x.xml"))'
>
>     Page-1
>     Traceback (most recent call last):
>       File "<string>", line 2, in <module>
>     xml.parsers.expat.ExpatError: not well-formed (invalid token): line 45,
>  column 63
>
> the problematic character is \x11
>
> I'm running version 0.12 of pdftohtml, installed from Debian
> poppler-utils_0.12.0-2_i386 package.
>
>     pdftohtml -v
>     pdftohtml version 0.12.0
>     Copyright 2005-2009 The Poppler Developers -
>  http://poppler.freedesktop.org Copyright 1999-2003 Gueorgui Ovtcharov and
>  Rainer Dorsch
>     Copyright 1996-2004 Glyph & Cog, LLC

Can you please post a but at bugs.freedesktop.org?

> how can i workaround this problem?

You can code a patch or wait until someone fixes it.

Albert

> best regards,
> Piotr Findeisen
>

_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: pdftohtml produces invalid XML

by Reece Dunn-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/3 Piotr Findeisen <piotr.findeisen@...>:

> Hi!
>
> I started using pdftohtml form Debian's poppler-utils package for document
> analysis and run across a problem that `pdftohtml -xml' can produce invalid
> XML on output (at least invalid for python xml tools).
>
> Test case:
>
> # wget -q
> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>     pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
>     python -c 'from xml.parsers.expat import ParserCreate;
> ParserCreate().ParseFile(open("x.xml"))'

I'm not sure what the fix is, but the line with the error is:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
and firefox gives:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
    ---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).

This will cause any xml parser to choke, as the characters are
invalid. What I don't know is why/how these are appearing in
pdftohtml.

Looking at the PDF in okular (which appears to render the PDF
correctly there), shows a mathematical equation for the faulty lines,
specifically:

<text top="606" left="101" width="173" height="10" font="6">Digital
signal processing basic formula:</text>
<text top="632" left="101" width="25" height="10" font="6">y(t) =</text>
<text top="626" left="133" width="0" height="0" font="7"> </text>
<text top="631" left="133" width="0" height="0" font="7">¡</text>
<text top="647" left="128" width="0" height="0" font="7">¢</text>
<text top="646" left="134" width="11" height="0" font="7"> ¤£</text>
<text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>

should be (in the proper math layout for this formula):

    y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du

where the h(u)x(t - u)du bit is in the stylised script used in maths.

My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).

Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font="7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.

Someone will need to dig around in the htmltopdf code and the
rendering of non-ascii characters.

HTH,
- Reece
_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: pdftohtml produces invalid XML

by Reece Dunn-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/3 Reece Dunn <msclrhd@...>:

> 2009/11/3 Piotr Findeisen <piotr.findeisen@...>:
>> Hi!
>>
>> I started using pdftohtml form Debian's poppler-utils package for document
>> analysis and run across a problem that `pdftohtml -xml' can produce invalid
>> XML on output (at least invalid for python xml tools).
>>
>> Test case:
>>
>> # wget -q
>> http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
>>     pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
>>     python -c 'from xml.parsers.expat import ParserCreate;
>> ParserCreate().ParseFile(open("x.xml"))'
>
> I'm not sure what the fix is, but the line with the error is:
>    <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
> and firefox gives:
>    <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
>    ---------------------------------------------------------------^
> (that is -- it is choking on the [00|11] character; there are also
> other chatacters in the latin-1 control character range (c < 0x20)).
>
> This will cause any xml parser to choke, as the characters are
> invalid. What I don't know is why/how these are appearing in
> pdftohtml.
>
> Looking at the PDF in okular (which appears to render the PDF
> correctly there), shows a mathematical equation for the faulty lines,
> specifically:
>
> <text top="606" left="101" width="173" height="10" font="6">Digital
> signal processing basic formula:</text>
> <text top="632" left="101" width="25" height="10" font="6">y(t) =</text>
> <text top="626" left="133" width="0" height="0" font="7"> </text>
> <text top="631" left="133" width="0" height="0" font="7">¡</text>
> <text top="647" left="128" width="0" height="0" font="7">¢</text>
> <text top="646" left="134" width="11" height="0" font="7"> ¤£</text>
> <text top="632" left="152" width="58" height="0"
> font="7">¥§¦©¨   ¥§    ¦ ¨   ¦</text>
>
> should be (in the proper math layout for this formula):
>
>    y(t) = integral [above: inf, below: -inf] h(u)x(t - u)du
>
> where the h(u)x(t - u)du bit is in the stylised script used in maths.
>
> My initial thought is that the characters are referencing the Unicode
> codepoints (e.g. in the U+2100 range). However, these all appear to be
> in the ascii range (i.e. not multi-byte UTF-8 as the encoding
> suggests, but I may be wrong as there look to be more characters than
> what is displayed).
>
> Instead, they look like they are codepoints into a special
> mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
> box to hand at the moment, so can't verify the font name)). This would
> make sense given the font="7" attribute and the seemingly random
> characters. And given the greater number of characters, this looks to
> be using a non-URF8 multi-byte encoding.
>
> Someone will need to dig around in the htmltopdf code and the
> rendering of non-ascii characters.

As a follow-up...

Not using the -xml option of pdftotext causes it to write a html file
that is similarly mangled w.r.t. the characters in the formula (from
the integral to the du differential component).

In addition to this, the layout does not match the formula for the
integral (not sure whether the ¢ is meant to be the integral sign or
not; if it is supposed to be the infinity sign, it should be above the
integral, not below it) and the font size is not consistent with the
"y(t) = " part. These rendering issues are obviously orthogonal to the
encoding issue.

- Reece
_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: pdftohtml produces invalid XML

by Piotr Findeisen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello!   

On 03.11.2009 23:38, Reece Dunn wrote:

# wget -q
http://www.tml.tkk.fi/Studies/T-110.557/2002/papers/burlacu_mihai.pdf && \
    pdftohtml -xml -i -c -f 1 -l 1 -noframes burlacu_mihai.pdf x && \
    python -c 'from xml.parsers.expat import ParserCreate;
ParserCreate().ParseFile(open("x.xml"))'
    
I'm not sure what the fix is, but the line with the error is:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
and firefox gives:
    <text top="632" left="152" width="58" height="0"
font="7">¥§¦©¨¥§¦¨ ¦</text>
    ---------------------------------------------------------------^
(that is -- it is choking on the [00|11] character; there are also
other chatacters in the latin-1 control character range (c < 0x20)).
  
Right. 0x11 is the first one to cause problem with python xml parser.


My initial thought is that the characters are referencing the Unicode
codepoints (e.g. in the U+2100 range). However, these all appear to be
in the ascii range (i.e. not multi-byte UTF-8 as the encoding
suggests, but I may be wrong as there look to be more characters than
what is displayed).
  
these problematic characters are all ASCII control characters


Instead, they look like they are codepoints into a special
mathematical font (e.g. Symbol(?) in Windows (I don't have as Windows
box to hand at the moment, so can't verify the font name)). This would
make sense given the font="7" attribute and the seemingly random
characters. And given the greater number of characters, this looks to
be using a non-URF8 multi-byte encoding.
  
font="7" attribute is generated by "pdftohtml -xml" and it's reference to
<font id="7" ...... /> element near the top of the produced XML document

And yes, there is some font mapping involved. I tried and wrote the equation in a new .tex document, but produced PDF contained only characters I know & read.
No matter how i produced PDF — pdflatex, latex & dvipdf, etc.
Someone will need to dig around in the htmltopdf code and the
rendering of non-ascii characters.
  
I agree this is where the problem begins, though I've never seen pdftohtml's source...

best regards,
Piotr




_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

signature.asc (261 bytes) Download Attachment

Re: pdftohtml produces invalid XML

by Piotr Findeisen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On 03.11.2009 22:43, Albert Astals Cid wrote:
> Can you please post a but at bugs.freedesktop.org?
Here it is https://bugs.freedesktop.org/show_bug.cgi?id=24890

thanks,
Piotr





_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

signature.asc (261 bytes) Download Attachment