|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Deleting stale URLs from Nutch/SolrHi,
We are using Nutch to crawl an internal site, and index content to Solr. The issue is that the site is run through a CMS, and occasionally pages are deleted, so that the corresponding URLs become invalid. Is there any way that Nutch can discover stale URLs during recrawls, or is the only solution a completely fresh crawl? Also, is it possible to have Nutch automatically remove such stale content from Solr? I am stumped by this problem, and would appreciate any pointers, or even thoughts on this. Regards, Gora |
|
|
Re: Deleting stale URLs from Nutch/SolrGora Mohanty wrote:
> Hi, > > We are using Nutch to crawl an internal site, and index content > to Solr. The issue is that the site is run through a CMS, and > occasionally pages are deleted, so that the corresponding URLs > become invalid. Is there any way that Nutch can discover stale > URLs during recrawls, or is the only solution a completely fresh > crawl? Also, is it possible to have Nutch automatically remove > such stale content from Solr? > > I am stumped by this problem, and would appreciate any pointers, > or even thoughts on this. Hi, Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). Now when it comes to removing them from Solr ... The simplest (no coding) way would be to dump the CrawlDb, use some scripting tools to collect just the URL-s with the status GONE, and send them as a <delete> command to Solr. A slightly more involved solution would be to implement a tool that reads such URLs directly from CrawlDb (using e.g. CrawlDbReader API) and then uses SolrJ API to send the same delete requests + commit. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Deleting stale URLs from Nutch/SolrOn Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki <ab@...> wrote: [...] > Stale (no longer existing) URLs are marked with STATUS_DB_GONE. > They are kept in Nutch crawldb to prevent their re-discovery > (through stale links pointing to these URL-s from other pages). > If you really want to remove them from CrawlDb you can filter > them out (using CrawlDbMerger with just one input db, and setting > your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. "./bin/nutch readdb crawl/crawldb -stats" shows me: CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 5 retry 0: 5 min score: 0.333 avg score: 0.4664 max score: 1.0 status 2 (db_fetched): 5 CrawlDb statistics: done Regards, Gora |
|
|
Re: Deleting stale URLs from Nutch/SolrGora Mohanty wrote:
> On Mon, 26 Oct 2009 17:26:23 +0100 > Andrzej Bialecki <ab@...> wrote: > [...] >> Stale (no longer existing) URLs are marked with STATUS_DB_GONE. >> They are kept in Nutch crawldb to prevent their re-discovery >> (through stale links pointing to these URL-s from other pages). >> If you really want to remove them from CrawlDb you can filter >> them out (using CrawlDbMerger with just one input db, and setting >> your URLFilters appropriately). > [...] > > Thank you for your help. Your suggestions look promising, but I > think that I did not make myself adequately clear. Once we have > completed a site crawl with Nutch, ideally I would like to be > able to find stale links without doing a complete recrawl, i.e., > only through restarting the crawl from where it last left off. Is > that possible. > > I tried a simple test on a local webserver with five pages in a > three-level hierarchy. The crawl completes, and discovers all > five URLs as expected. Now, I remove a tertiary page. Ideally, > I would like to be able run a recrawl, and have Nutch dicover > the now-missing URL. However, when I try that, it finds no new > links, and exits. I assume you mean that the "generate" step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. You can pretend that the time moved on using the -adddays parameter. Then Nutch will generate a new fetchlist, and when it discovers that the page is missing it will mark it as gone - actually, you could then take that information directly from the Nutch segment and instead of processing the CrawlDb you could process the segment to collect a partial list of gone pages. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Deleting stale URLs from Nutch/SolrOn Tue, 27 Oct 2009 07:29:10 +0100
Andrzej Bialecki <ab@...> wrote: [...] > I assume you mean that the "generate" step produces no new URL-s > to fetch? That's expected, because they become eligible for > re-fetching only after Nutch considers them expired, i.e. after > the fetchTime + fetchInterval, and the default fetchInterval is > 30 days. Yes, it was indeed stopping at the generate step, and your explanation makes sense. > You can pretend that the time moved on using the -adddays > parameter. [...] Thanks. This worked exactly as you said. I have tested this, and the removed page indeed shows up with status db_gone, and I can now script a solution for my problem with stale URLs, along the lines that you have suggested. Thank you very much for this quick and thorough response. As I imagine that this is a common requirement, I will write up a brief blog entry on this by the weekend, along with a solution. Regards, Gora |
| Free embeddable forum powered by Nabble | Forum Help |