unable to parse PDF :(

View: New views
2 Messages — Rating Filter:   Alert me  

unable to parse PDF :(

by tarunsapra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi People,,

I have searched a lot but i am not able to find the answer to this problem "Unable to successfully parse content" ...and the thing that most confuses me is ..using a standalone Lucene code i am able to extract text from the PDFs but when i use Nutch , then it's giving parsing error.

Thanks

Re: unable to parse PDF :(

by Kirby Bohling-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 6, 2009 at 11:42 AM, tarunsapra <t.sapra97@...> wrote:

>
> Hi People,,
>
> I have searched a lot but i am not able to find the answer to this problem
> "Unable to successfully parse content" ...and the thing that most confuses
> me is ..using a standalone Lucene code i am able to extract text from the
> PDFs but when i use Nutch , then it's giving parsing error.
>
> Thanks
> --
> View this message in context: http://old.nabble.com/unable-to-parse-PDF-%3A%28-tp26230843p26230843.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Have you verified that it isn't getting cut off due the http filesize
download limit?  Out of the box, I believe that Nutch only downloads
64K, and I've seen lots of PDF's get cut off and be unparseable in our
crawls due to that.

Thanks,
   Kirby