pdftohtml outputs hidden text

View: New views
2 Messages — Rating Filter:   Alert me  

pdftohtml outputs hidden text

by Piotr Findeisen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

I run across a problem that pdftohtml and pdftotext sometimes outputs
hidden text, even when not using -hidden switch (in pdftohtml).
Example:

wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
114.pdf - | grep 'Picture to be added here'

When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
the first page, right under the " Typical BWA network layout." image.
But well, it's not really displayed there.

"pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
<fontspec id="16" size="13" family="Times" color="#0000ff"/>
but it gives no clue that the text is not printed on the screen.

Is this some special feature of PDF that causes some text to be not
displayed or displayed with 0% opacity?
Is it possible to capture this meta data with pdftohtml or generally
with poppler suite?

best regards,
Piotr



_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler

signature.asc (261 bytes) Download Attachment

Re: pdftohtml outputs hidden text

by Albert Astals Cid-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A Dimecres, 4 de novembre de 2009, Piotr Findeisen va escriure:

> Hi!
>
> I run across a problem that pdftohtml and pdftotext sometimes outputs
> hidden text, even when not using -hidden switch (in pdftohtml).
> Example:
>
> wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
> 114.pdf - | grep 'Picture to be added here'
>
> When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
> Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
> the first page, right under the " Typical BWA network layout." image.
> But well, it's not really displayed there.
>
> "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
> <fontspec id="16" size="13" family="Times" color="#0000ff"/>
> but it gives no clue that the text is not printed on the screen.
>
> Is this some special feature of PDF that causes some text to be not
> displayed or displayed with 0% opacity?

From a quick look at the code it seems the code is creating a clip path
outside where the text is rendered, effectively rendering nothing.

> Is it possible to capture this meta data with pdftohtml or generally
> with poppler suite?

It is, you'll have to make the text tools take the clip areas into account,
not an easy task.

Albert

>
> best regards,
> Piotr
>

_______________________________________________
poppler mailing list
poppler@...
http://lists.freedesktop.org/mailman/listinfo/poppler