|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
Distributed search, is there a better method?I was curious if anyone has a better method than what I am doing now for
distributed search. Using one namenode and two datanodes, I am also using the namenode as the tomcat server, and using the datanodes as the distributed search nodes. First I generate a segment (-topN 1000) Then fetch Then updatedb Then merge segements into one. Then invertlinks Then I do the following: I get the stats of the crawldb, in order to get the number of URLs I run mergesegs -slice (1/2 number of urls) index segment 1 and copytolocal the new index to datanode 1 index segment 2 and copytolocal this new index to datanode 2 then restart nutch servers on the datanodes. It seems to work fine, though I admit I've not gotten beyond about 30k urls fetched with about 100k urls still unfetched. I tried using the -slice option on the initial merge, but I found on occasion there was no parse data in one segment, or I got an unexpected number of segments. I'm guessing this is because updatedb needs to be run before I can get an accurate number of URLs to do the math to get the same number of segments as search servers. One problem I've run into so far is the amount of time the generate command increases with each iteration. The only item that really seems to grow out of control is the unfetched URLs, which is expected with such a small sample of web pages, but it doesn't make sense to me as to why it would take so long to generate a list of 1000 urls to fetch out of a list of 100k. Those are small numbers in terms of database and computing in general. The next hangup I run into is the mergesegs and mergesegs -slice. Both of these steps increase in amount of time by an extreme amount once reaching about 100k URLs. Is this expected or common? Has anyone come up with a better way to go through the steps to get multiple unique indexes to reside on the individual search server nodes? This is purely academic for me, so there really is no time lost on my part to change up my approach. I am also purposely using low power hardware. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com |
|
|
Re: Distributed search, is there a better method?Hi,
Generating from a 100K crawlDB should be quite fast. Have you checked that the IP resolution is turned off? Do you have any special URL filters that could take a lot of time to process? Generating and merging tend to take more and more time as the crawlDB grows but this should not be too much of an issue at your scale. Could you dump the stats of your crawlDB and tell us how long the generation step takes? > One problem I've run into so far is the amount of time the generate command > increases with each iteration. The only item that really seems to grow out > of control is the unfetched URLs, which is expected with such a small > sample > of web pages, but it doesn't make sense to me as to why it would take so > long to generate a list of 1000 urls to fetch out of a list of 100k. Those > are small numbers in terms of database and computing in general. > > Julien -- DigitalPebble Ltd http://www.digitalpebble.com |
| Free embeddable forum powered by Nabble | Forum Help |