All,
I am a masters student and want to crawl the whole web for my masters project.
While trying to generate, fetch, crawl the whole web using Nutch (I am following steps from
http://lucene.apache.org/nutch/tutorial8.html), I got confused among various nutch terms and usage:
1) What is the purpose and difference between
crawl_fetch and
crawldb ? If nutch stores all the info regarding urls in
crawldb, then what is the need for
crawl_fetch?
2) Moreover, what does fetch and generate do? Can anyone describe in detail? Is there any documentation for nutch commands like generate, fetch, etc?
Thanks & Regards,
Gaurang Patel