error nutch recrawl

View: New views
2 Messages — Rating Filter:   Alert me  

error nutch recrawl

by Maurizio Croci :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, I try to REcrawl (with a shell-script. I have already a webDB...) a
website (with some links to other webpage, .html, .doc, .pdf, ...) but this
error occured:

...
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090703140431
Generator: filtering: true
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20090703111416
Exception in thread "main" java.io.IOException: Segment already fetched!
        at org.apache.nutch.fetcher.
FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)
...

someone have any idea to resolve this problem?

thx to all

Maurizio [croci.maurizio@...]

Re: error nutch recrawl

by Xiao Yang :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You can use bin/hadoop fs -rmr crawl to delete the whole directory and
Recrawl.

On Tue, Jul 7, 2009 at 1:47 AM, Maurizio Croci <croci.maurizio@...>wrote:

> Hi, I try to REcrawl (with a shell-script. I have already a webDB...) a
> website (with some links to other webpage, .html, .doc, .pdf, ...) but this
> error occured:
>
> ...
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20090703140431
> Generator: filtering: true
> Generator: topN: 10
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20090703111416
> Exception in thread "main" java.io.IOException: Segment already fetched!
>        at org.apache.nutch.fetcher.
> FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:50)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)
> ...
>
> someone have any idea to resolve this problem?
>
> thx to all
>
> Maurizio [croci.maurizio@...]
>