|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
pdftohtml outputs hidden textHi!
I run across a problem that pdftohtml and pdftotext sometimes outputs hidden text, even when not using -hidden switch (in pdftohtml). Example: wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext 114.pdf - | grep 'Picture to be added here' When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on the first page, right under the " Typical BWA network layout." image. But well, it's not really displayed there. "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as <fontspec id="16" size="13" family="Times" color="#0000ff"/> but it gives no clue that the text is not printed on the screen. Is this some special feature of PDF that causes some text to be not displayed or displayed with 0% opacity? Is it possible to capture this meta data with pdftohtml or generally with poppler suite? best regards, Piotr _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
|
|
Re: pdftohtml outputs hidden textA Dimecres, 4 de novembre de 2009, Piotr Findeisen va escriure:
> Hi! > > I run across a problem that pdftohtml and pdftotext sometimes outputs > hidden text, even when not using -hidden switch (in pdftohtml). > Example: > > wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext > 114.pdf - | grep 'Picture to be added here' > > When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in > Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on > the first page, right under the " Typical BWA network layout." image. > But well, it's not really displayed there. > > "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as > <fontspec id="16" size="13" family="Times" color="#0000ff"/> > but it gives no clue that the text is not printed on the screen. > > Is this some special feature of PDF that causes some text to be not > displayed or displayed with 0% opacity? From a quick look at the code it seems the code is creating a clip path outside where the text is rendered, effectively rendering nothing. > Is it possible to capture this meta data with pdftohtml or generally > with poppler suite? It is, you'll have to make the text tools take the clip areas into account, not an easy task. Albert > > best regards, > Piotr > _______________________________________________ poppler mailing list poppler@... http://lists.freedesktop.org/mailman/listinfo/poppler |
| Free embeddable forum powered by Nabble | Forum Help |