We have crawled a million urls and we want to continuously recrawl these sites for updates.
The DFS cluster architecture is having 4 machines with 1 Master and 4 Slaves. To crawl the
1 miilion sites it took around 10 days.
How possibly we will have a recrawl strategy to get the updates quickly? How will we optimize
the Nutch recrawl script so that frequently changing sites will be recrawled quickly and the index is formed?
Could we do an incremental index building from the crawl db someway?
Please suggest.