|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
nutch refetch by db.fetch.interval.default not workingHi,
I am using Nutch 1.0, with cygwin on windows xp. I plan to fetch a set of urls regularly just upto depth 1. 5 urls mentioned in urls folder in nutch-home. The problem I face is: Though I mention "db.fetch.interval.default" in nutch-site.xml as 1 second ; I am not able to see it getting reflected. I am using 5 URLs of the same host. Process starts, fetches these 5 and ends... db.fetch.interval.default is set to 1 second. So why are these 5 URLs not fetched continuously, before process termination. (Considering adaptive fetch interval changes, I expect it to fetch atleast 2-3 times). At the time to fetch an URL, what will happen exactly? Will this URL be added to the CURRENT FETCHLIST? I want these URLs to be fetched without interruption. Other observation is that these URLs will be fetched exactly ONCE more when I increase the depth to 2. Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !? Kindly help |
|
|
Re: nutch refetch by db.fetch.interval.default not workingif you want to recrawl urls, you have to generate a new segment, fetch
this segment and update the crawl db. example script: bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays $adddays segment=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb crawl/crawldb $segment -normalize -filter or if you use the crawl tool, you have to use a depth > 1. depth means number of recrawls. the crawl tool is doing the same as above. the fetcher does not continuously crawl urls. it crawls the urls in a segement once and the next fetchtime is updated according to fetch interval. Sista Sasidhar schrieb: > Hi, > I am using Nutch 1.0, with cygwin on windows xp. > I plan to fetch a set of urls regularly just upto depth 1. > 5 urls mentioned in urls folder in nutch-home. > The problem I face is: > Though I mention "db.fetch.interval.default" in nutch-site.xml as 1 second > ; I am not able to see it getting reflected. I am using 5 URLs of the same > host. Process starts, fetches these 5 and ends... db.fetch.interval.default > is set to 1 second. So why are these 5 URLs not fetched continuously, before > process termination. (Considering adaptive fetch interval changes, I expect > it to fetch atleast 2-3 times). > > At the time to fetch an URL, what will happen exactly? Will this URL be > added to the CURRENT FETCHLIST? I want these URLs to be fetched without > interruption. Other observation is that these URLs will be fetched exactly > ONCE more when I increase the depth to 2. > > Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !? > > Kindly help > > |
|
|
Re: nutch refetch by db.fetch.interval.default not workingYes I understand that the script given by u is perfectly correct. And that
is my backup option actually. But I want to know why is this db.fetch.internval.default option is present !? You said in ur last para; "it crawls the urls in a segement once and the *next fetchtime is updated according to fetch interval*." Why is the fetcher doing this update? I see the purpose is, IF THE CRAWLER is still actively running, this URL has to be added to the CURRENT fetchlist. This addition is expected to happen nearly at "*NEXT FETCHTIME of that URL"*. Otherwise I dont see the purpose of updating NEXT Fetchtime of that URL, if it is in my hands to run the script at my will. Why should it care about NEXT FETCHTIME UPDATIION? Kindly reply. Thank you On Wed, Nov 4, 2009 at 6:53 PM, reinhard schwab <reinhard.schwab@...>wrote: > if you want to recrawl urls, you have to generate a new segment, fetch > this segment > and update the crawl db. > > example script: > > bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays > $adddays > segment=`ls -d crawl/segments/* | tail -1` > bin/nutch fetch $segment > bin/nutch updatedb crawl/crawldb $segment -normalize -filter > > or if you use the crawl tool, you have to use a depth > 1. > depth means number of recrawls. > the crawl tool is doing the same as above. > > the fetcher does not continuously crawl urls. > it crawls the urls in a segement once and the next fetchtime is > updated according to fetch interval. > > > Sista Sasidhar schrieb: > > Hi, > > I am using Nutch 1.0, with cygwin on windows xp. > > I plan to fetch a set of urls regularly just upto depth 1. > > 5 urls mentioned in urls folder in nutch-home. > > The problem I face is: > > Though I mention "db.fetch.interval.default" in nutch-site.xml as 1 > second > > ; I am not able to see it getting reflected. I am using 5 URLs of the > same > > host. Process starts, fetches these 5 and ends... > db.fetch.interval.default > > is set to 1 second. So why are these 5 URLs not fetched continuously, > before > > process termination. (Considering adaptive fetch interval changes, I > expect > > it to fetch atleast 2-3 times). > > > > At the time to fetch an URL, what will happen exactly? Will this URL be > > added to the CURRENT FETCHLIST? I want these URLs to be fetched without > > interruption. Other observation is that these URLs will be fetched > exactly > > ONCE more when I increase the depth to 2. > > > > Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !? > > > > Kindly help > > > > > > |
| Free embeddable forum powered by Nabble | Forum Help |