What are the configuration parameters to fine tune Nutch performance

View: New views
2 Messages — Rating Filter:   Alert me  

What are the configuration parameters to fine tune Nutch performance

by saravan.krish :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am new to nutch. I have few questions
1) Can anyone please let me know the configuration parameters by which we can improve and fine tune the nutch performance?

2) Also is there any way to resume the crawling process when it failed?

Re: What are the configuration parameters to fine tune Nutch performance

by John Whelan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The default tuning parameters are specified in nutch/conf/nutch-default.xml, and can be overridden in nutch/conf/nutch-site.xml. (Or in the crawl command line, but I believe that the 'best practice' is to configure settings in nutch-site.xml.)

My personal belief is that the two most valuable parameters for tuning the crawler are 'fetcher.threads.fetch' and 'fetcher.threads.per.host'. However, there are lots of other parameters for tuning, and you might find more value in some of the timeout parameters. (You might also want to look at tuning you JVM heap space, but I've never seen a real need to tweak it.)

As far as resuming a failed crawl, I don't know of any way to do so. I always discard and restart.