Nutch Topical / Focused Crawl

View: New views
2 Messages — Rating Filter:   Alert me  

Nutch Topical / Focused Crawl

by MyD () :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right track. I think the right peace of code to implement a decision to fetch further is in the method output of the Fetcher class every time we call the collect method of the OutputCollector object.

private ParseStatus output(Text key, CrawlDatum datum, Content content, ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision into an plugin? I was thinking to go a similar way like the scoring filters. Thanks in advance.

Cheers,
MyD

Re: Nutch Topical / Focused Crawl

by MyD :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I just found an interesting thesis which explains how to turn / modify Nutch into a focused / topical crawler. This thesis helped me a lot. Maybe useful to others...

http://wing.comp.nus.edu.sg/publications/theses/2009/markusHaenseThesis.pdf


MyD wrote:
Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. I started to analyze the code and think that I found the right peace of code. I just wanted to know if I am on the right track. I think the right peace of code to implement a decision to fetch further is in the method output of the Fetcher class every time we call the collect method of the OutputCollector object.

private ParseStatus output(Text key, CrawlDatum datum, Content content, ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision into an plugin? I was thinking to go a similar way like the scoring filters. Thanks in advance.

Cheers,
MyD