|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.I often get this error message while crawling the intranet
Is it the network problem? What can I do for it? $bin/nutch crawl urls -dir crawl -depth 3 -topN 4 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 4 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20090705212324 Generator: filtering: true Generator: topN: 4 Generator: Partitioning selected urls by host, for politeness. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Generator.generate(Generator.java:524) at org.apache.nutch.crawl.Generator.generate(Generator.java:409) at org.apache.nutch.crawl.Crawl.main(Crawl.java:116) |
|
|
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Yes, I am also occuring to this problem. Can anyone help?
On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote: > I often get this error message while crawling the intranet > Is it the network problem? What can I do for it? > > $bin/nutch crawl urls -dir crawl -depth 3 -topN 4 > > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > topN = 4 > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20090705212324 > Generator: filtering: true > Generator: topN: 4 > Generator: Partitioning selected urls by host, for politeness. > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > at org.apache.nutch.crawl.Generator.generate(Generator.java:524) > at org.apache.nutch.crawl.Generator.generate(Generator.java:409) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:116) > |
|
|
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.anyone help? so disappointed.
On Fri, Jul 10, 2009 at 4:29 PM, lei wang <nutchmaillist@...> wrote: > Yes, I am also occuring to this problem. Can anyone help? > > > On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote: > >> I often get this error message while crawling the intranet >> Is it the network problem? What can I do for it? >> >> $bin/nutch crawl urls -dir crawl -depth 3 -topN 4 >> >> crawl started in: crawl >> rootUrlDir = urls >> threads = 10 >> depth = 3 >> topN = 4 >> Injector: starting >> Injector: crawlDb: crawl/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl/segments/20090705212324 >> Generator: filtering: true >> Generator: topN: 4 >> Generator: Partitioning selected urls by host, for politeness. >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) >> at org.apache.nutch.crawl.Generator.generate(Generator.java:524) >> at org.apache.nutch.crawl.Generator.generate(Generator.java:409) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:116) >> > > |
|
|
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.lei wang wrote:
> anyone help? so disappointed. > > On Fri, Jul 10, 2009 at 4:29 PM, lei wang <nutchmaillist@...> wrote: > >> Yes, I am also occuring to this problem. Can anyone help? >> >> >> On Sun, Jul 5, 2009 at 11:33 PM, xiao yang <yangxiao9901@...> wrote: >> >>> I often get this error message while crawling the intranet >>> Is it the network problem? What can I do for it? >>> >>> $bin/nutch crawl urls -dir crawl -depth 3 -topN 4 >>> >>> crawl started in: crawl >>> rootUrlDir = urls >>> threads = 10 >>> depth = 3 >>> topN = 4 >>> Injector: starting >>> Injector: crawlDb: crawl/crawldb >>> Injector: urlDir: urls >>> Injector: Converting injected urls to crawl db entries. >>> Injector: Merging injected urls into crawl db. >>> Injector: done >>> Generator: Selecting best-scoring urls due for fetch. >>> Generator: starting >>> Generator: segment: crawl/segments/20090705212324 >>> Generator: filtering: true >>> Generator: topN: 4 >>> Generator: Partitioning selected urls by host, for politeness. >>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>> Exception in thread "main" java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) >>> at org.apache.nutch.crawl.Generator.generate(Generator.java:524) >>> at org.apache.nutch.crawl.Generator.generate(Generator.java:409) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:116) >>> >> > If you are running a large crawl on a single machine, you could be running out of file descriptors - please check "ulimit -n", the value should be much much larger than 1024. Also, please check the hadoop.log for clues why shuffle fetching failed - this could be something trivial as a blocked port, or routing problem, or DNS resolution problem, or the problem I mentioned above. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
| Free embeddable forum powered by Nabble | Forum Help |