Incremental Whole Web Crawling

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can  
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's  
then crawl the links generated from the TLD's in increments of 100K?

Thanks,

EO

Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric wrote:
> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then crawl the links generated from the TLD's in increments of 100K?

Yes. Make sure that you have the "generate.update.db" property set to
true, and then generate 16 segments each having 100k urls. After you
finish generating them, then you can start fetching.

Similarly, you can do the same for the next level, only you will have to
generate more segments.

This could be done much simpler with a modified Generator that outputs
multiple segments from one job, but it's not implemented yet.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Andrzej,

Just to make sure I have this straight, set the generate.update.db  
property to true then

bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?

Thanks,

Eric

On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:

> Eric wrote:
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I  
>> can crawl it in increments of 100K? e.g. crawl 100K 16 times for  
>> the TLD's then crawl the links generated from the TLD's in  
>> increments of 100K?
>
> Yes. Make sure that you have the "generate.update.db" property set  
> to true, and then generate 16 segments each having 100k urls. After  
> you finish generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will  
> have to generate more segments.
>
> This could be done much simpler with a modified Generator that  
> outputs multiple segments from one job, but it's not implemented yet.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>


Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric wrote:
> Andrzej,
>
> Just to make sure I have this straight, set the generate.update.db
> property to true then
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?

Yes. When this property is set to true, then each fetchlist will be
different, because the records for those pages that are already on
another fetchlist will be temporarily locked. Please note that this lock
holds only for 1 week, so you need to fetch all segments within one week
from generating them.

You can fetch and updatedb in arbitrary order, so once you fetched some
segments you can run the parsing and updatedb just from these segments,
without waiting for all 16 segments to be processed.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Gaurang Patel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hey Andrzej,

Can you tell me where to set this property (generate.update.db)? I am trying
to run similar kind of crawl scenario that Eric is running.

-Gaurang

2009/10/5 Andrzej Bialecki <ab@...>

> Eric wrote:
>
>> Andrzej,
>>
>> Just to make sure I have this straight, set the generate.update.db
>> property to true then
>>
>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>
>
> Yes. When this property is set to true, then each fetchlist will be
> different, because the records for those pages that are already on another
> fetchlist will be temporarily locked. Please note that this lock holds only
> for 1 week, so you need to fetch all segments within one week from
> generating them.
>
> You can fetch and updatedb in arbitrary order, so once you fetched some
> segments you can run the parsing and updatedb just from these segments,
> without waiting for all 16 segments to be processed.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Incremental Whole Web Crawling

by Gaurang Patel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hey,

Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
true.

Regards,
Gaurang

2009/10/5 Gaurang Patel <gaurangtpatel@...>

> Hey Andrzej,
>
> Can you tell me where to set this property (generate.update.db)? I am
> trying to run similar kind of crawl scenario that Eric is running.
>
> -Gaurang
>
> 2009/10/5 Andrzej Bialecki <ab@...>
>
> Eric wrote:
>>
>>> Andrzej,
>>>
>>> Just to make sure I have this straight, set the generate.update.db
>>> property to true then
>>>
>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>
>>
>> Yes. When this property is set to true, then each fetchlist will be
>> different, because the records for those pages that are already on another
>> fetchlist will be temporarily locked. Please note that this lock holds only
>> for 1 week, so you need to fetch all segments within one week from
>> generating them.
>>
>> You can fetch and updatedb in arbitrary order, so once you fetched some
>> segments you can run the parsing and updatedb just from these segments,
>> without waiting for all 16 segments to be processed.
>>
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Incremental Whole Web Crawling

by Paul Tomblin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there.  That way the change will
(hopefully) survive an upgrade.

On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel <gaurangtpatel@...> wrote:

> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <gaurangtpatel@...>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab@...>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on another
>>> fetchlist will be temporarily locked. Please note that this lock holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched some
>>> segments you can run the parsing and updatedb just from these segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>



--
http://www.linkedin.com/in/paultomblin

Re: Incremental Whole Web Crawling

by Julien Nioche-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>

This would also be more efficient as crawlDB operations such as generate or
update take more time as the crawlDB grows (unlike fetch and parse which are
proportional to the size of the fetchlist). When the crawlDB sizes in
billions of URL the fetching / parsing takes relatively little time.

generate.update.db requires to read and write a whole crawlDB everytime but
I suppose that it would be fine for a small crawlDB

J.

--
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

When I set generate.update.db to true and then run generate, it only  
runs twice and generates 100K for the 1st gen, 62.5K for the second  
gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand  
this, for a topN of 100K it should run 16 times and create 16 distinct  
lists if I am not mistaken.

Eric


On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:

> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and  
> set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <gaurangtpatel@...>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <ab@...>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16  
>>>> times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on  
>>> another
>>> fetchlist will be temporarily locked. Please note that this lock  
>>> holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched  
>>> some
>>> segments you can run the parsing and updatedb just from these  
>>> segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>> ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@..., eric@...
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric Osgood wrote:
> When I set generate.update.db to true and then run generate, it only
> runs twice and generates 100K for the 1st gen, 62.5K for the second gen
> and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this,
> for a topN of 100K it should run 16 times and create 16 distinct lists
> if I am not mistaken.

There was a bug in this code that I fixed recently - please get a new
nightly build and try it again.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Andrzej,

Where do I get the nightly builds from? I tried to use the eclipse  
plugin that supports svn to no avail. Is there a ftp, http server  
where I can download the nutch source fresh?

Thanks,

Eric

On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> When I set generate.update.db to true and then run generate, it  
>> only runs twice and generates 100K for the 1st gen, 62.5K for the  
>> second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't  
>> understand this, for a topN of 100K it should run 16 times and  
>> create 16 distinct lists if I am not mistaken.
>
> There was a bug in this code that I fixed recently - please get a  
> new nightly build and try it again.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@..., eric@...
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric Osgood wrote:
> Andrzej,
>
> Where do I get the nightly builds from? I tried to use the eclipse
> plugin that supports svn to no avail. Is there a ftp, http server where
> I can download the nutch source fresh?

Personally I prefer to use a command-line svn, even though I do
development in Eclipse - I'm probably old-fashioned but I always want to
be very clear on what's going on when I do an update.

See the instructions here:

http://lucene.apache.org/nutch/version_control.html


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ok, I think I am on the right track now, but just to be sure: the code  
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ 
  correct?

Thanks,

Eric


On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Andrzej,
>> Where do I get the nightly builds from? I tried to use the eclipse  
>> plugin that supports svn to no avail. Is there a ftp, http server  
>> where I can download the nutch source fresh?
>
> Personally I prefer to use a command-line svn, even though I do  
> development in Eclipse - I'm probably old-fashioned but I always  
> want to be very clear on what's going on when I do an update.
>
> See the instructions here:
>
> http://lucene.apache.org/nutch/version_control.html
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@..., eric@...
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric Osgood wrote:
> Ok, I think I am on the right track now, but just to be sure: the code I
> want is the branch section of svn under nutchbase at
> http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct?

No, you need the trunk from here:

http://svn.apache.org/repos/asf/lucene/nutch/trunk


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

So the trunk contains the most recent nightly update?
On Oct 13, 2009, at 1:50 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> Ok, I think I am on the right track now, but just to be sure: the  
>> code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ 
>>  correct?
>
> No, you need the trunk from here:
>
> http://svn.apache.org/repos/asf/lucene/nutch/trunk
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@..., eric@...
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Incremental Whole Web Crawling

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eric Osgood wrote:
> So the trunk contains the most recent nightly update?

It's the other way around - nightly build is created from a snapshot of
the trunk. The trunk is always the most recent.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Incremental Whole Web Crawling

by Eric Osgood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

O ok,

You learn something new everyday! I didn't know that the trunk was the  
most recent build. Good to know! So this current trunk does have a fix  
for the generator bug?


On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote:

> Eric Osgood wrote:
>> So the trunk contains the most recent nightly update?
>
> It's the other way around - nightly build is created from a snapshot  
> of the trunk. The trunk is always the most recent.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosgood@..., eric@...
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com


Re: Incremental Whole Web Crawling

by Julien Nioche-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

FYI : there is an implementation of such a modified Generator in
http://issues.apache.org/jira/browse/NUTCH-762

Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com

2009/10/5 Andrzej Bialecki <ab@...>

> Eric wrote:
>
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then
>> crawl the links generated from the TLD's in increments of 100K?
>>
>
> Yes. Make sure that you have the "generate.update.db" property set to true,
> and then generate 16 segments each having 100k urls. After you finish
> generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will have to
> generate more segments.
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Incremental Whole Web Crawling

by Jesse Hires :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Julien,
I tried to apply your patch because I was curious.
$ patch < NUTCH-762-MultiGenerator.patch

but this seems to drop the two java files into the root directory instead of
src/java/org/apache/nutch/crawl/URLPartitioner.java
src/java/org/apache/nutch/crawl/MultiGenerator.java

But if I copy the files to those locations, I get compile errors.
I'm up to date on the svn trunk.
Did I miss a step?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpebble@...
> wrote:

> FYI : there is an implementation of such a modified Generator in
> http://issues.apache.org/jira/browse/NUTCH-762
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/10/5 Andrzej Bialecki <ab@...>
>
> > Eric wrote:
> >
> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then
> >> crawl the links generated from the TLD's in increments of 100K?
> >>
> >
> > Yes. Make sure that you have the "generate.update.db" property set to
> true,
> > and then generate 16 segments each having 100k urls. After you finish
> > generating them, then you can start fetching.
> >
> > Similarly, you can do the same for the next level, only you will have to
> > generate more segments.
> >
> > This could be done much simpler with a modified Generator that outputs
> > multiple segments from one job, but it's not implemented yet.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>

Re: Incremental Whole Web Crawling

by Jesse Hires :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My apologies. missed a patch option :-P
Must need more coffee.
Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jhires@...> wrote:

> Julien,
> I tried to apply your patch because I was curious.
> $ patch < NUTCH-762-MultiGenerator.patch
>
> but this seems to drop the two java files into the root directory instead
> of
> src/java/org/apache/nutch/crawl/URLPartitioner.java
> src/java/org/apache/nutch/crawl/MultiGenerator.java
>
> But if I copy the files to those locations, I get compile errors.
> I'm up to date on the svn trunk.
> Did I miss a step?
>
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
>
> On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <
> lists.digitalpebble@...> wrote:
>
>> FYI : there is an implementation of such a modified Generator in
>> http://issues.apache.org/jira/browse/NUTCH-762
>>
>> Julien
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>> 2009/10/5 Andrzej Bialecki <ab@...>
>>
>> > Eric wrote:
>> >
>> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
>> then
>> >> crawl the links generated from the TLD's in increments of 100K?
>> >>
>> >
>> > Yes. Make sure that you have the "generate.update.db" property set to
>> true,
>> > and then generate 16 segments each having 100k urls. After you finish
>> > generating them, then you can start fetching.
>> >
>> > Similarly, you can do the same for the next level, only you will have to
>> > generate more segments.
>> >
>> > This could be done much simpler with a modified Generator that outputs
>> > multiple segments from one job, but it's not implemented yet.
>> >
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki     <><
>> >  ___. ___ ___ ___ _ _   __________________________________
>> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> > http://www.sigram.com  Contact: info at sigram dot com
>> >
>> >
>>
>
>
< Prev | 1 - 2 | Next >