Growing the index : Merging vs incremental

View: New views
2 Messages — Rating Filter:   Alert me  

Growing the index : Merging vs incremental

by sprabhu_PN :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Currently we crawl every two days and create a new index and then merge with earlier index. For one it takes  too long as mergesegs seems to take time proportional to the size of both indexes combined.  Equally problematic issue is mergesegs fail a significant portion of the time. Probability becomes higher with size.Problems exist whether merge is done within Hadoop or outside.  

Two questions:
(a) Has anybody been successful to do a Nutch merge predictably irrespective of the size. Any tips.  We are trying to merge upto data for 200K url at a time.

(b) How can we do incremental indexing, where we add data from latest crawl, but there is only one index that keeps growing.  I saw lot of older posts regarding incremental indexing and no clear answers.

Thanks in advance for your help.

Shreekanth

Re: Growing the index : Merging vs incremental

by Fadzi Ushewokunze-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi,

we are in a sort of similar situation;

So would be really happy to hear any suggestions on this.

incremental crawling doesnt seem to really work for us because it seems
the same urls are being crawled over and over (on a daily basis!);

have you tried these settings or similar?

db.fetch.schedule.class = AdaptiveFetchSchedule
db.update.additions.allowed = true
db.ignore.internal.links = false
db.ignore.external.links = true (because we are intranet only)



>
>
> Currently we crawl every two days and create a new index and then merge
> with
> earlier index. For one it takes  too long as mergesegs seems to take time
> proportional to the size of both indexes combined.  Equally problematic
> issue is mergesegs fail a significant portion of the time. Probability
> becomes higher with size.Problems exist whether merge is done within
> Hadoop
> or outside.
>
> Two questions:
> (a) Has anybody been successful to do a Nutch merge predictably
> irrespective
> of the size. Any tips.  We are trying to merge upto data for 200K url at a
> time.
>
> (b) How can we do incremental indexing, where we add data from latest
> crawl,
> but there is only one index that keeps growing.  I saw lot of older posts
> regarding incremental indexing and no clear answers.
>
> Thanks in advance for your help.
>
> Shreekanth
>
> --
> View this message in context:
> http://old.nabble.com/Growing-the-index-%3A-Merging-vs-incremental-tp26228341p26228341.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>