Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

View: New views
4 Messages — Rating Filter:   Alert me  

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

by Xiao Yang :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I often get this error message while crawling the intranet
Is it the network problem? What can I do for it?

$bin/nutch crawl urls -dir crawl -depth 3 -topN 4

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 4
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090705212324
Generator: filtering: true
Generator: topN: 4
Generator: Partitioning selected urls by host, for politeness.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.crawl.Generator.generate(Generator.java:524)
    at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes, I am also occuring to  this problem. Can anyone help?

On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote:

> I often get this error message while crawling the intranet
> Is it the network problem? What can I do for it?
>
> $bin/nutch crawl urls -dir crawl -depth 3 -topN 4
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 4
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20090705212324
> Generator: filtering: true
> Generator: topN: 4
> Generator: Partitioning selected urls by host, for politeness.
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> Exception in thread "main" java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>    at org.apache.nutch.crawl.Generator.generate(Generator.java:524)
>    at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

anyone help? so disappointed.

On Fri, Jul 10, 2009 at 4:29 PM, lei wang <nutchmaillist@...> wrote:

> Yes, I am also occuring to  this problem. Can anyone help?
>
>
> On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote:
>
>> I often get this error message while crawling the intranet
>> Is it the network problem? What can I do for it?
>>
>> $bin/nutch crawl urls -dir crawl -depth 3 -topN 4
>>
>> crawl started in: crawl
>> rootUrlDir = urls
>> threads = 10
>> depth = 3
>> topN = 4
>> Injector: starting
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20090705212324
>> Generator: filtering: true
>> Generator: topN: 4
>> Generator: Partitioning selected urls by host, for politeness.
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> Exception in thread "main" java.io.IOException: Job failed!
>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>    at org.apache.nutch.crawl.Generator.generate(Generator.java:524)
>>    at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)
>>
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

lei wang wrote:

> anyone help? so disappointed.
>
> On Fri, Jul 10, 2009 at 4:29 PM, lei wang <nutchmaillist@...> wrote:
>
>> Yes, I am also occuring to  this problem. Can anyone help?
>>
>>
>> On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote:
>>
>>> I often get this error message while crawling the intranet
>>> Is it the network problem? What can I do for it?
>>>
>>> $bin/nutch crawl urls -dir crawl -depth 3 -topN 4
>>>
>>> crawl started in: crawl
>>> rootUrlDir = urls
>>> threads = 10
>>> depth = 3
>>> topN = 4
>>> Injector: starting
>>> Injector: crawlDb: crawl/crawldb
>>> Injector: urlDir: urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Injector: Merging injected urls into crawl db.
>>> Injector: done
>>> Generator: Selecting best-scoring urls due for fetch.
>>> Generator: starting
>>> Generator: segment: crawl/segments/20090705212324
>>> Generator: filtering: true
>>> Generator: topN: 4
>>> Generator: Partitioning selected urls by host, for politeness.
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> Exception in thread "main" java.io.IOException: Job failed!
>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>>    at org.apache.nutch.crawl.Generator.generate(Generator.java:524)
>>>    at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
>>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)
>>>
>>
>

If you are running a large crawl on a single machine, you could be
running out of file descriptors - please check "ulimit -n", the value
should be much much larger than 1024.

Also, please check the hadoop.log for clues why shuffle fetching failed
- this could be something trivial as a blocked port, or routing problem,
or DNS resolution problem, or the problem I mentioned above.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com