« Return to Thread: Per-host fetch-interval

Re: Per-host fetch-interval

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View in Thread

Sandeep Tata wrote:
> Hi,
>
> I was wondering what would be the best way to configure per-host
> re-crawl intervals. The default db.fetch.interval applies to all URLs,
> but I'd like for some hosts to be recrawled more frequently. Is there
> a JIRA ticket open on this? I haven't been able to find one

Fetch interval can be set on individual CrawlDatum-s in crawldb, at
least technically speaking. In practice, there is no command-line tool
to do this, and I don;t think there is a JIRA on this.

One idea would be to modify the Injector to accept a list of URL-s with
matching metadata, and among others use a predefined metadata like
fetchInterval. On the initial injection, all values in CrawlDatum would
be set according to the metadata (or set to defaults). On subsequent
injections, if a URL already exists in CrawlDb, its metadata would be
reset to the values supplied in the injector file.

This should be easy to implement, and I think it would support your use
case.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

 « Return to Thread: Per-host fetch-interval