How to make nutch crawl within a sub category of an URL?

View: New views
2 Messages — Rating Filter:   Alert me  

How to make nutch crawl within a sub category of an URL?

by saravan.krish :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi,

Can anyone please let me know how to make nutch crawl within a sub category
of a URL?

For example, if I want to crawl within "Computers & Internet" category of
answers.yahoo.com. How do I do it with Nutch?

URL:
http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&sid=396545660

--
View this message in context: http://old.nabble.com/How-to-make-nutch-crawl-within-a-sub-category-of-an-URL--tp26160005p26160005.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: How to make nutch crawl within a sub category of an URL?

by John Whelan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If it were me, I'd try the following...

Use 'http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&sid=396545660' as a starting point URL, and set up the following filtering rules (crawl-urlfilter.txt):

   +^http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&sid=396545660
   +^http://answers.yahoo.com/question
   -.

This should allow the 'Computers & Internet' page to be crawled, and also allow the associated questions to be crawled, but wouldn't traverse beyond that. In order to be sure, you would also want to limit your crawl depth to 2 or 3.