|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
error nutch recrawlHi, I try to REcrawl (with a shell-script. I have already a webDB...) a
website (with some links to other webpage, .html, .doc, .pdf, ...) but this error occured: ... Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20090703140431 Generator: filtering: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawl/segments/20090703111416 Exception in thread "main" java.io.IOException: Segment already fetched! at org.apache.nutch.fetcher. FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003) ... someone have any idea to resolve this problem? thx to all Maurizio [croci.maurizio@...] |
|
|
Re: error nutch recrawlYou can use bin/hadoop fs -rmr crawl to delete the whole directory and
Recrawl. On Tue, Jul 7, 2009 at 1:47 AM, Maurizio Croci <croci.maurizio@...>wrote: > Hi, I try to REcrawl (with a shell-script. I have already a webDB...) a > website (with some links to other webpage, .html, .doc, .pdf, ...) but this > error occured: > > ... > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20090703140431 > Generator: filtering: true > Generator: topN: 10 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting > Fetcher: segment: crawl/segments/20090703111416 > Exception in thread "main" java.io.IOException: Segment already fetched! > at org.apache.nutch.fetcher. > FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003) > ... > > someone have any idea to resolve this problem? > > thx to all > > Maurizio [croci.maurizio@...] > |
| Free embeddable forum powered by Nabble | Forum Help |