|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
Targeting Specific LinksIs there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering Moon Valley Software --------------------------------------------- eosgood@... eric@... --------------------------------------------- www.calpoly.edu/eosgood www.lakemeadonline.com |
|
|
Re: Targeting Specific LinksEric Osgood wrote:
> Is there a way to inspect the list of links that nutch finds per page > and then at that point choose which links I want to include / exclude? > that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Targeting Specific LinksAndrzej,
How would I check for a flag during fetch? Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non- relevant links. Thanks, Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/eosgood, www.lakemeadonline.com On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> Is there a way to inspect the list of links that nutch finds per >> page and then at that point choose which links I want to include / >> exclude? that is the ideal remedy to my problem. > > Yes, look at ParseOutputFormat, you can make this decision there. > There are two standard etension points where you can hook up - > URLFilters and ScoringFilters. > > Please note that if you use URLFilters to filter out URL-s too early > then they will be rediscovered again and again. A better method to > handle this, but also more complicated, is to still include such > links but give them a special flag (in metadata) that prevents > fetching. This requires that you implement a custom scoring plugin. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > |
|
|
Re: Targeting Specific LinksEric Osgood wrote:
> Andrzej, > > How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). > > Maybe this explanation can shed some light: > Ideally, I would like to check the list of links for each page, but > still needing a total of X links per page, if I find the links I want, I > add them to the list up until X, if I don' reach X, I add other links > until X is reached. This way, I don't waste crawl time on non-relevant > links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Targeting Specific LinksAndrzej,
Based on what you suggested below, I have begun to write my own scoring plugin: in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData ("KEEP", "true"). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: > Eric Osgood wrote: >> Andrzej, >> How would I check for a flag during fetch? > > You would check for a flag during generation - please check > ScoringFilter.generatorSortValue(), that's where you can check for a > flag and set the sort value to Float.MIN_VALUE - this way the link > will never be selected for fetching. > > And you would put the flag in CrawlDatum metadata when > ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). > >> Maybe this explanation can shed some light: >> Ideally, I would like to check the list of links for each page, but >> still needing a total of X links per page, if I find the links I >> want, I add them to the list up until X, if I don' reach X, I add >> other links until X is reached. This way, I don't waste crawl time >> on non-relevant links. > > You can modify the collection of target links passed to > distributeScoreToOutlinks() - this way you can affect both which > links are stored and what kind of metadata each of them gets. > > As I said, you can also use just plain URLFilters to filter out > unwanted links, but that API gives you much less control because > it's a simple yes/no that considers just URL string. The advantage > is that it's much easier to implement than a ScoringFilter. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Targeting Specific LinksAlso,
In the scoring-links plugin, I set the return value for ScoringFilter.generatorSortValue() to Float.MinValue for all urls and it still fetched everything - maybe Float.MinValue isn't the correct value to set so a link never gets fetched? Thanks, Eric On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote: > Andrzej, > > Based on what you suggested below, I have begun to write my own > scoring plugin: > > in distributeScoreToOutlinks() if the link contains the string im > looking for, I set its score to kept_score and add a flag to the > metaData in parseData ("KEEP", "true"). How do I check for this flag > in generatorSortValue()? I only see a way to check the score, not a > flag. > > Thanks, > > Eric > > > On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: > >> Eric Osgood wrote: >>> Andrzej, >>> How would I check for a flag during fetch? >> >> You would check for a flag during generation - please check >> ScoringFilter.generatorSortValue(), that's where you can check for >> a flag and set the sort value to Float.MIN_VALUE - this way the >> link will never be selected for fetching. >> >> And you would put the flag in CrawlDatum metadata when >> ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). >> >>> Maybe this explanation can shed some light: >>> Ideally, I would like to check the list of links for each page, >>> but still needing a total of X links per page, if I find the links >>> I want, I add them to the list up until X, if I don' reach X, I >>> add other links until X is reached. This way, I don't waste crawl >>> time on non-relevant links. >> >> You can modify the collection of target links passed to >> distributeScoreToOutlinks() - this way you can affect both which >> links are stored and what kind of metadata each of them gets. >> >> As I said, you can also use just plain URLFilters to filter out >> unwanted links, but that API gives you much less control because >> it's a simple yes/no that considers just URL string. The advantage >> is that it's much easier to implement than a ScoringFilter. >> >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> > > Eric Osgood > --------------------------------------------- > Cal Poly - Computer Engineering, Moon Valley Software > --------------------------------------------- > eosgood@..., eric@... > --------------------------------------------- > www.calpoly.edu/~eosgood, www.lakemeadonline.com > Eric Osgood --------------------------------------------- Cal Poly - Computer Engineering, Moon Valley Software --------------------------------------------- eosgood@..., eric@... --------------------------------------------- www.calpoly.edu/~eosgood, www.lakemeadonline.com |
|
|
Re: Targeting Specific LinksEric Osgood wrote:
> Andrzej, > > Based on what you suggested below, I have begun to write my own scoring > plugin: Great! > > in distributeScoreToOutlinks() if the link contains the string im > looking for, I set its score to kept_score and add a flag to the > metaData in parseData ("KEEP", "true"). How do I check for this flag in > generatorSortValue()? I only see a way to check the score, not a flag. The flag should have been automagically added to the target CrawlDatum metadata after you have updated your crawldb (see the details in CrawlDbReducer). Then in generatorSortValue() you can check for the presence of this flag by using the datum.getMetaData(). BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any special way ... I thought it did. It's easy to add this, though - in Generator.java:161 just add this: if (sort == Float.MIN_VALUE) { return; } -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
| Free embeddable forum powered by Nabble | Forum Help |