Recrawl Strategy with Nutch!

View: New views
1 Messages — Rating Filter:   Alert me  

Recrawl Strategy with Nutch!

by tittutomen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We have crawled a million urls and we want to continuously recrawl these sites for updates.

The DFS cluster architecture is having 4 machines with 1 Master and 4 Slaves. To crawl the

1 miilion sites it took around 10 days.

 

How possibly we will have a recrawl strategy to get the updates quickly? How will we optimize

the Nutch recrawl script so that frequently changing sites will be recrawled quickly and the index is formed?

Could we do an incremental index building from the crawl db someway?

 

Please suggest.