« Return to Thread: Per-host fetch-interval

Re: Per-host fetch-interval

by Sandeep Tata :: Rate this Message:

Reply to Author | View in Thread


Thanks Andrzej.
I'm planning to modify the update tool to reset the fetchInterval in the crawldb for hosts specified in separate file.



On Wed, Jun 24, 2009 at 1:39 AM, Andrzej Bialecki <ab@...> wrote:
Sandeep Tata wrote:
Hi,

I was wondering what would be the best way to configure per-host
re-crawl intervals. The default db.fetch.interval applies to all URLs,
but I'd like for some hosts to be recrawled more frequently. Is there
a JIRA ticket open on this? I haven't been able to find one

Fetch interval can be set on individual CrawlDatum-s in crawldb, at least technically speaking. In practice, there is no command-line tool to do this, and I don;t think there is a JIRA on this.

One idea would be to modify the Injector to accept a list of URL-s with matching metadata, and among others use a predefined metadata like fetchInterval. On the initial injection, all values in CrawlDatum would be set according to the metadata (or set to defaults). On subsequent injections, if a URL already exists in CrawlDb, its metadata would be reset to the values supplied in the injector file.

This should be easy to implement, and I think it would support your use case.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 « Return to Thread: Per-host fetch-interval