nutch refetch by db.fetch.interval.default not working

View: New views
3 Messages — Rating Filter:   Alert me  

nutch refetch by db.fetch.interval.default not working

by Sista Sasidhar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
 I am using Nutch 1.0, with cygwin on windows xp.
I plan to fetch a set of urls regularly just upto depth 1.
5 urls mentioned in urls folder in nutch-home.
The problem I face is:
 Though I mention "db.fetch.interval.default" in nutch-site.xml as 1 second
; I am not able to see it getting reflected. I am using 5 URLs of the same
host. Process starts, fetches these 5 and ends... db.fetch.interval.default
is set to 1 second. So why are these 5 URLs not fetched continuously, before
process termination. (Considering adaptive fetch interval changes, I expect
it to fetch atleast 2-3 times).

 At the time to fetch an URL, what will happen exactly? Will this URL be
added to the CURRENT FETCHLIST? I want these URLs to be fetched without
interruption. Other observation is that these URLs will be fetched exactly
ONCE more when I increase the depth to 2.

Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !?

Kindly help

Re: nutch refetch by db.fetch.interval.default not working

by reinhard schwab :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

if you want to recrawl urls, you have to generate a new segment, fetch
this segment
and update the crawl db.

example script:

bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays
$adddays
segment=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb crawl/crawldb $segment -normalize -filter

or if you use the crawl tool, you have to use a depth > 1.
depth means number of recrawls.
the crawl tool is doing the same as above.

the fetcher does not continuously crawl urls.
it crawls the urls in a segement  once and the next fetchtime is
updated according to  fetch interval.


Sista Sasidhar schrieb:

> Hi,
>  I am using Nutch 1.0, with cygwin on windows xp.
> I plan to fetch a set of urls regularly just upto depth 1.
> 5 urls mentioned in urls folder in nutch-home.
> The problem I face is:
>  Though I mention "db.fetch.interval.default" in nutch-site.xml as 1 second
> ; I am not able to see it getting reflected. I am using 5 URLs of the same
> host. Process starts, fetches these 5 and ends... db.fetch.interval.default
> is set to 1 second. So why are these 5 URLs not fetched continuously, before
> process termination. (Considering adaptive fetch interval changes, I expect
> it to fetch atleast 2-3 times).
>
>  At the time to fetch an URL, what will happen exactly? Will this URL be
> added to the CURRENT FETCHLIST? I want these URLs to be fetched without
> interruption. Other observation is that these URLs will be fetched exactly
> ONCE more when I increase the depth to 2.
>
> Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !?
>
> Kindly help
>
>  


Re: nutch refetch by db.fetch.interval.default not working

by Sista Sasidhar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes I understand that the script given by u is perfectly correct. And that
is my backup option actually.
But I want to know why is this db.fetch.internval.default option is present
!? You said in ur last para;

"it crawls the urls in a segement  once and the *next fetchtime is
updated according to  fetch interval*."

Why is the fetcher doing this update? I see the purpose is, IF THE CRAWLER
is still actively running, this URL has to be added to the CURRENT
fetchlist. This addition is expected to happen nearly at "*NEXT FETCHTIME of
that URL"*. Otherwise I dont see the purpose of updating NEXT Fetchtime of
that URL, if it is in my hands to run the script at my will. Why should it
care about NEXT FETCHTIME UPDATIION?

Kindly reply. Thank you

On Wed, Nov 4, 2009 at 6:53 PM, reinhard schwab <reinhard.schwab@...>wrote:

> if you want to recrawl urls, you have to generate a new segment, fetch
> this segment
> and update the crawl db.
>
> example script:
>
> bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays
> $adddays
> segment=`ls -d crawl/segments/* | tail -1`
> bin/nutch fetch $segment
> bin/nutch updatedb crawl/crawldb $segment -normalize -filter
>
> or if you use the crawl tool, you have to use a depth > 1.
> depth means number of recrawls.
> the crawl tool is doing the same as above.
>
> the fetcher does not continuously crawl urls.
> it crawls the urls in a segement  once and the next fetchtime is
> updated according to  fetch interval.
>
>
> Sista Sasidhar schrieb:
> > Hi,
> >  I am using Nutch 1.0, with cygwin on windows xp.
> > I plan to fetch a set of urls regularly just upto depth 1.
> > 5 urls mentioned in urls folder in nutch-home.
> > The problem I face is:
> >  Though I mention "db.fetch.interval.default" in nutch-site.xml as 1
> second
> > ; I am not able to see it getting reflected. I am using 5 URLs of the
> same
> > host. Process starts, fetches these 5 and ends...
> db.fetch.interval.default
> > is set to 1 second. So why are these 5 URLs not fetched continuously,
> before
> > process termination. (Considering adaptive fetch interval changes, I
> expect
> > it to fetch atleast 2-3 times).
> >
> >  At the time to fetch an URL, what will happen exactly? Will this URL be
> > added to the CURRENT FETCHLIST? I want these URLs to be fetched without
> > interruption. Other observation is that these URLs will be fetched
> exactly
> > ONCE more when I increase the depth to 2.
> >
> > Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !?
> >
> > Kindly help
> >
> >
>
>