Distributed search, is there a better method?

View: New views
2 Messages — Rating Filter:   Alert me  

Distributed search, is there a better method?

by Jesse Hires :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I was curious if anyone has a better method than what I am doing now for
distributed search.
Using one namenode and two datanodes, I am also using the namenode as the
tomcat server, and using the datanodes as the distributed search nodes.

First I generate a segment (-topN 1000)
Then  fetch
Then  updatedb
Then  merge segements into one.
Then  invertlinks

Then I do the following:
I get the stats of the crawldb, in order to get the number of URLs
I run mergesegs -slice (1/2 number of urls)

index segment 1 and copytolocal the new index to datanode 1
index segment 2 and copytolocal this new index to datanode 2

then restart nutch servers on the datanodes.

It seems to work fine, though I admit I've not gotten beyond about 30k urls
fetched with about 100k urls still unfetched.

I tried using the -slice option on the initial merge, but I found on
occasion there was no parse data in one segment, or I got an unexpected
number of segments. I'm guessing this is because updatedb needs to be run
before I can get an accurate number of URLs to do the math to get the same
number of segments as search servers.

One problem I've run into so far is the amount of time the generate command
increases with each iteration. The only item that really seems to grow out
of control is the unfetched URLs, which is expected with such a small sample
of web pages, but it doesn't make sense to me as to why it would take so
long to generate a list of 1000 urls to fetch out of a list of 100k. Those
are small numbers in terms of database and computing in general.

The next hangup I run into is the mergesegs and mergesegs -slice. Both of
these steps increase in amount of time by an extreme amount once reaching
about 100k URLs.


Is this expected or common?
Has anyone come up with a better way to go through the steps to get multiple
unique indexes to reside on the individual search server nodes?

This is purely academic for me, so there really is no time lost on my part
to change up my approach. I am also purposely using low power hardware.

Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com

Re: Distributed search, is there a better method?

by Julien Nioche-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Generating from a 100K crawlDB should be quite fast. Have you checked that
the IP resolution is turned off? Do you have any special URL filters that
could take a lot of time to process? Generating and merging tend to take
more and more time as the crawlDB grows but this should not be too much of
an issue at your scale.

Could you dump the stats of your crawlDB and tell us how long the generation
step takes?



> One problem I've run into so far is the amount of time the generate command
> increases with each iteration. The only item that really seems to grow out
> of control is the unfetched URLs, which is expected with such a small
> sample
> of web pages, but it doesn't make sense to me as to why it would take so
> long to generate a list of 1000 urls to fetch out of a list of 100k. Those
> are small numbers in terms of database and computing in general.
>
>
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com