On Fri, Nov 6, 2009 at 11:42 AM, tarunsapra <
t.sapra97@...> wrote:
>
> Hi People,,
>
> I have searched a lot but i am not able to find the answer to this problem
> "Unable to successfully parse content" ...and the thing that most confuses
> me is ..using a standalone Lucene code i am able to extract text from the
> PDFs but when i use Nutch , then it's giving parsing error.
>
> Thanks
> --
> View this message in context:
http://old.nabble.com/unable-to-parse-PDF-%3A%28-tp26230843p26230843.html> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Have you verified that it isn't getting cut off due the http filesize
download limit? Out of the box, I believe that Nutch only downloads
64K, and I've seen lots of PDF's get cut off and be unparseable in our
crawls due to that.
Thanks,
Kirby