|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Incremental Whole Web CrawlingMy plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Thanks, EO |
|
|
Re: Incremental Whole Web CrawlingEric wrote:
> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can > crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's > then crawl the links generated from the TLD's in increments of 100K? Yes. Make sure that you have the "generate.update.db" property set to true, and then generate 16 segments each having 100k urls. After you finish generating them, then you can start fetching. Similarly, you can do the same for the next level, only you will have to generate more segments. This could be done much simpler with a modified Generator that outputs multiple segments from one job, but it's not implemented yet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingAndrzej,
Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times? Thanks, Eric On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote: > Eric wrote: >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I >> can crawl it in increments of 100K? e.g. crawl 100K 16 times for >> the TLD's then crawl the links generated from the TLD's in >> increments of 100K? > > Yes. Make sure that you have the "generate.update.db" property set > to true, and then generate 16 segments each having 100k urls. After > you finish generating them, then you can start fetching. > > Similarly, you can do the same for the next level, only you will > have to generate more segments. > > This could be done much simpler with a modified Generator that > outputs multiple segments from one job, but it's not implemented yet. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > |
|
|
Re: Incremental Whole Web CrawlingEric wrote:
> Andrzej, > > Just to make sure I have this straight, set the generate.update.db > property to true then > > bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only for 1 week, so you need to fetch all segments within one week from generating them. You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingHey Andrzej,
Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki <ab@...> > Eric wrote: > >> Andrzej, >> >> Just to make sure I have this straight, set the generate.update.db >> property to true then >> >> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times? >> > > Yes. When this property is set to true, then each fetchlist will be > different, because the records for those pages that are already on another > fetchlist will be temporarily locked. Please note that this lock holds only > for 1 week, so you need to fetch all segments within one week from > generating them. > > You can fetch and updatedb in arbitrary order, so once you fetched some > segments you can run the parsing and updatedb just from these segments, > without waiting for all 16 segments to be processed. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > |
|
|
Re: Incremental Whole Web CrawlingHey,
Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel <gaurangtpatel@...> > Hey Andrzej, > > Can you tell me where to set this property (generate.update.db)? I am > trying to run similar kind of crawl scenario that Eric is running. > > -Gaurang > > 2009/10/5 Andrzej Bialecki <ab@...> > > Eric wrote: >> >>> Andrzej, >>> >>> Just to make sure I have this straight, set the generate.update.db >>> property to true then >>> >>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times? >>> >> >> Yes. When this property is set to true, then each fetchlist will be >> different, because the records for those pages that are already on another >> fetchlist will be temporarily locked. Please note that this lock holds only >> for 1 week, so you need to fetch all segments within one week from >> generating them. >> >> You can fetch and updatedb in arbitrary order, so once you fetched some >> segments you can run the parsing and updatedb just from these segments, >> without waiting for all 16 segments to be processed. >> >> >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> > |
|
|
Re: Incremental Whole Web CrawlingDon't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there. That way the change will (hopefully) survive an upgrade. On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel <gaurangtpatel@...> wrote: > Hey, > > Never mind. I got *generate.update.db* in *nutch-default.xml* and set it > true. > > Regards, > Gaurang > > 2009/10/5 Gaurang Patel <gaurangtpatel@...> > >> Hey Andrzej, >> >> Can you tell me where to set this property (generate.update.db)? I am >> trying to run similar kind of crawl scenario that Eric is running. >> >> -Gaurang >> >> 2009/10/5 Andrzej Bialecki <ab@...> >> >> Eric wrote: >>> >>>> Andrzej, >>>> >>>> Just to make sure I have this straight, set the generate.update.db >>>> property to true then >>>> >>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times? >>>> >>> >>> Yes. When this property is set to true, then each fetchlist will be >>> different, because the records for those pages that are already on another >>> fetchlist will be temporarily locked. Please note that this lock holds only >>> for 1 week, so you need to fetch all segments within one week from >>> generating them. >>> >>> You can fetch and updatedb in arbitrary order, so once you fetched some >>> segments you can run the parsing and updatedb just from these segments, >>> without waiting for all 16 segments to be processed. >>> >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >> > -- http://www.linkedin.com/in/paultomblin |
|
|
Re: Incremental Whole Web Crawling>
> > This could be done much simpler with a modified Generator that outputs > multiple segments from one job, but it's not implemented yet. > This would also be more efficient as crawlDB operations such as generate or update take more time as the crawlDB grows (unlike fetch and parse which are proportional to the size of the fetchlist). When the crawlDB sizes in billions of URL the fetching / parsing takes relatively little time. generate.update.db requires to read and write a whole crawlDB everytime but I suppose that it would be fine for a small crawlDB J. -- DigitalPebble Ltd http://www.digitalpebble.com |
|
|
Re: Incremental Whole Web CrawlingWhen I set generate.update.db to true and then run generate, it only
runs twice and generates 100K for the 1st gen, 62.5K for the second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, for a topN of 100K it should run 16 times and create 16 distinct lists if I am not mistaken. Eric On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote: > Hey, > > Never mind. I got *generate.update.db* in *nutch-default.xml* and > set it > true. > > Regards, > Gaurang > > 2009/10/5 Gaurang Patel <gaurangtpatel@...> > >> Hey Andrzej, >> >> Can you tell me where to set this property (generate.update.db)? I am >> trying to run similar kind of crawl scenario that Eric is running. >> >> -Gaurang >> >> 2009/10/5 Andrzej Bialecki <ab@...> >> >> Eric wrote: >>> >>>> Andrzej, >>>> >>>> Just to make sure I have this straight, set the generate.update.db >>>> property to true then >>>> >>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 >>>> times? >>>> >>> >>> Yes. When this property is set to true, then each fetchlist will be >>> different, because the records for those pages that are already on >>> another >>> fetchlist will be temporarily locked. Please note that this lock >>> holds only >>> for 1 week, so you need to fetch all segments within one week from >>> generating them. >>> >>> You can fetch and updatedb in arbitrary order, so once you fetched >>> some >>> segments you can run the parsing and updatedb just from these >>> segments, >>> without waiting for all 16 segments to be processed. >>> >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >> Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Incremental Whole Web CrawlingEric Osgood wrote:
> When I set generate.update.db to true and then run generate, it only > runs twice and generates 100K for the 1st gen, 62.5K for the second gen > and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, > for a topN of 100K it should run 16 times and create 16 distinct lists > if I am not mistaken. There was a bug in this code that I fixed recently - please get a new nightly build and try it again. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingAndrzej,
Where do I get the nightly builds from? I tried to use the eclipse plugin that supports svn to no avail. Is there a ftp, http server where I can download the nutch source fresh? Thanks, Eric On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> When I set generate.update.db to true and then run generate, it >> only runs twice and generates 100K for the 1st gen, 62.5K for the >> second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't >> understand this, for a topN of 100K it should run 16 times and >> create 16 distinct lists if I am not mistaken. > > There was a bug in this code that I fixed recently - please get a > new nightly build and try it again. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Incremental Whole Web CrawlingEric Osgood wrote:
> Andrzej, > > Where do I get the nightly builds from? I tried to use the eclipse > plugin that supports svn to no avail. Is there a ftp, http server where > I can download the nutch source fresh? Personally I prefer to use a command-line svn, even though I do development in Eclipse - I'm probably old-fashioned but I always want to be very clear on what's going on when I do an update. See the instructions here: http://lucene.apache.org/nutch/version_control.html -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingOk, I think I am on the right track now, but just to be sure: the code
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? Thanks, Eric On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> Andrzej, >> Where do I get the nightly builds from? I tried to use the eclipse >> plugin that supports svn to no avail. Is there a ftp, http server >> where I can download the nutch source fresh? > > Personally I prefer to use a command-line svn, even though I do > development in Eclipse - I'm probably old-fashioned but I always > want to be very clear on what's going on when I do an update. > > See the instructions here: > > http://lucene.apache.org/nutch/version_control.html > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Incremental Whole Web CrawlingEric Osgood wrote:
> Ok, I think I am on the right track now, but just to be sure: the code I > want is the branch section of svn under nutchbase at > http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? No, you need the trunk from here: http://svn.apache.org/repos/asf/lucene/nutch/trunk -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingSo the trunk contains the most recent nightly update?
On Oct 13, 2009, at 1:50 PM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> Ok, I think I am on the right track now, but just to be sure: the >> code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ >> correct? > > No, you need the trunk from here: > > http://svn.apache.org/repos/asf/lucene/nutch/trunk > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Incremental Whole Web CrawlingEric Osgood wrote:
> So the trunk contains the most recent nightly update? It's the other way around - nightly build is created from a snapshot of the trunk. The trunk is always the most recent. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Incremental Whole Web CrawlingO ok,
You learn something new everyday! I didn't know that the trunk was the most recent build. Good to know! So this current trunk does have a fix for the generator bug? On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> So the trunk contains the most recent nightly update? > > It's the other way around - nightly build is created from a snapshot > of the trunk. The trunk is always the most recent. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Incremental Whole Web CrawlingFYI : there is an implementation of such a modified Generator in
http://issues.apache.org/jira/browse/NUTCH-762 Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/10/5 Andrzej Bialecki <ab@...> > Eric wrote: > >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then >> crawl the links generated from the TLD's in increments of 100K? >> > > Yes. Make sure that you have the "generate.update.db" property set to true, > and then generate 16 segments each having 100k urls. After you finish > generating them, then you can start fetching. > > Similarly, you can do the same for the next level, only you will have to > generate more segments. > > This could be done much simpler with a modified Generator that outputs > multiple segments from one job, but it's not implemented yet. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > |
|
|
Re: Incremental Whole Web CrawlingJulien,
I tried to apply your patch because I was curious. $ patch < NUTCH-762-MultiGenerator.patch but this seems to drop the two java files into the root directory instead of src/java/org/apache/nutch/crawl/URLPartitioner.java src/java/org/apache/nutch/crawl/MultiGenerator.java But if I copy the files to those locations, I get compile errors. I'm up to date on the svn trunk. Did I miss a step? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpebble@... > wrote: > FYI : there is an implementation of such a modified Generator in > http://issues.apache.org/jira/browse/NUTCH-762 > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/10/5 Andrzej Bialecki <ab@...> > > > Eric wrote: > > > >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can > >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's > then > >> crawl the links generated from the TLD's in increments of 100K? > >> > > > > Yes. Make sure that you have the "generate.update.db" property set to > true, > > and then generate 16 segments each having 100k urls. After you finish > > generating them, then you can start fetching. > > > > Similarly, you can do the same for the next level, only you will have to > > generate more segments. > > > > This could be done much simpler with a modified Generator that outputs > > multiple segments from one job, but it's not implemented yet. > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > |
|
|
Re: Incremental Whole Web CrawlingMy apologies. missed a patch option :-P
Must need more coffee. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jhires@...> wrote: > Julien, > I tried to apply your patch because I was curious. > $ patch < NUTCH-762-MultiGenerator.patch > > but this seems to drop the two java files into the root directory instead > of > src/java/org/apache/nutch/crawl/URLPartitioner.java > src/java/org/apache/nutch/crawl/MultiGenerator.java > > But if I copy the files to those locations, I get compile errors. > I'm up to date on the svn trunk. > Did I miss a step? > > > Jesse > > int GetRandomNumber() > { > return 4; // Chosen by fair roll of dice > // Guaranteed to be random > } // xkcd.com > > > > > On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche < > lists.digitalpebble@...> wrote: > >> FYI : there is an implementation of such a modified Generator in >> http://issues.apache.org/jira/browse/NUTCH-762 >> >> Julien >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> >> 2009/10/5 Andrzej Bialecki <ab@...> >> >> > Eric wrote: >> > >> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can >> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's >> then >> >> crawl the links generated from the TLD's in increments of 100K? >> >> >> > >> > Yes. Make sure that you have the "generate.update.db" property set to >> true, >> > and then generate 16 segments each having 100k urls. After you finish >> > generating them, then you can start fetching. >> > >> > Similarly, you can do the same for the next level, only you will have to >> > generate more segments. >> > >> > This could be done much simpler with a modified Generator that outputs >> > multiple segments from one job, but it's not implemented yet. >> > >> > >> > -- >> > Best regards, >> > Andrzej Bialecki <>< >> > ___. ___ ___ ___ _ _ __________________________________ >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> > ___|||__|| \| || | Embedded Unix, System Integration >> > http://www.sigram.com Contact: info at sigram dot com >> > >> > >> > > |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |