|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
manually creating indices to speed up indexing with app-knowledgeThis may seem like a strange question, but here it goes anyway.
Im considering the possibility of low-level constructing indices for about 20.000 indexed fields (type sInt) if at all possible . (With indices in this context I mean the inverted indices from term to Documentid just to be 100% complete) These indices have to be recreated each night, along with the normal reindex. Globally it should go something like this (each night) : - documents (consisting of about 20 stored fields and about 10 stored & indexed fields) are indexed through the normal 'code-path' (solrJ in my case) - After all docs are persisted (max 200.000) I want to extract the mapping from 'lucene docid' --> 'stored/indexed product key' I believe this should work, because after all docs are persisted the internal docids aren't altered, so the relationship between 'lucene docid' --> 'stored/indexed product key' is invariant from that point forward. (please correct if wrong) - construct the 20.000 inverted indices on such a low enough level that I do not have to go through IndexWriter if possible, so I do not need to construct Documents, I only need to construct the native format of the indices themselves. Ideally this should work on multiple servers so that the indices can be created in parallel and the index-files later simply copied to the index-directory of the master. Basically what it boils down to is that indexing time (a reindex should be done each night) is a big show-stopper at the moment, although we've tried and tested all the more standard optimization tricks & techniques, as well as having build a home-grown shard-like indexing strategy which uses 20 pretty big servers in parallel. The 20.000 indexed fields are still simply killing. At the same time the app has a lot of knowledge of the 20.000 indices. - All indices consist of prices (ints) between 0 and 10.000 - and most important: as part of the document construction process the ordening of each of the 20.000 indices is known for all documents that are processed by the document-construction server in question. (This part is needed, and is already performing at light speed) for sake of argument say we have 5 document-construction servers. Each server processes 40.000 documents. Each server has 20.000 ordered indices in its own format readily available for the 40.000 documents it's processing. Something like: LinkedHashMap<Integer,Set<Integer>> --> <price,{productids}> Say we have 20 indexing servers. Each server has to calculate 1.000 indices (totalling the 20.000) We have the 5 doc-construction servers distribute the ordered sub-indices to the correct servers. Each server constructs an index from 5 ordered sub-indices coming from 5 different construction-servers. This can be done efficiently using a mergesort (since the sub-indices are already sorted) All that is missing (oversimplifying here ) is going from the ordered indices in application-format to the index-format of lucene (substituting the productids by the lucene docid's along the way) and stream it to disk. I believe this would quite posisbly give a really big indexing improvement. Is my thinking correct in the steps involved? Do you believe that this indeed would give a big speedup for this specific situation Where would I hook in the SOlr / lucene code to construct the native format? Thanks in advance (and for making it to here) Geert-Jan |
|
|
Re: manually creating indices to speed up indexing with app-knowledgeBritske,
The place to ask is on java-user@lucene if you want to go low-level. Look at IndexWriter and even DocumentsWriter classes. I'm not sure how up to date it is, but look at http://lucene.apache.org/java/2_9_0/fileformats.html You should also try streaming your data directly into Solr, it's the fastest way to index. Info on the Wiki. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Britske <gbrits@...> > To: solr-user@... > Sent: Mon, November 2, 2009 4:40:04 PM > Subject: manually creating indices to speed up indexing with app-knowledge > > > This may seem like a strange question, but here it goes anyway. > > Im considering the possibility of low-level constructing indices for about > 20.000 indexed fields (type sInt) if at all possible . (With indices in this > context I mean the inverted indices from term to Documentid just to be 100% > complete) > These indices have to be recreated each night, along with the normal > reindex. > > Globally it should go something like this (each night) : > - documents (consisting of about 20 stored fields and about 10 stored & > indexed fields) are indexed through the normal 'code-path' (solrJ in my > case) > - After all docs are persisted (max 200.000) I want to extract the mapping > from 'lucene docid' --> 'stored/indexed product key' > I believe this should work, because after all docs are persisted the > internal docids aren't altered, so the relationship between 'lucene docid' > --> 'stored/indexed product key' is invariant from that point forward. > (please correct if wrong) > - construct the 20.000 inverted indices on such a low enough level that I do > not have to go through IndexWriter if possible, so I do not need to > construct Documents, I only need to construct the native format of the > indices themselves. Ideally this should work on multiple servers so that the > indices can be created in parallel and the index-files later simply copied > to the index-directory of the master. > > Basically what it boils down to is that indexing time (a reindex should be > done each night) is a big show-stopper at the moment, although we've tried > and tested all the more standard optimization tricks & techniques, as well > as having build a home-grown shard-like indexing strategy which uses 20 > pretty big servers in parallel. The 20.000 indexed fields are still simply > killing. > > At the same time the app has a lot of knowledge of the 20.000 indices. > - All indices consist of prices (ints) between 0 and 10.000 > - and most important: as part of the document construction process the > ordening of each of the 20.000 indices is known for all documents that are > processed by the document-construction server in question. (This part is > needed, and is already performing at light speed) > > for sake of argument say we have 5 document-construction servers. Each > server processes 40.000 documents. Each server has 20.000 ordered indices in > its own format readily available for the 40.000 documents it's processing. > Something like: LinkedHashMap> --> > > > Say we have 20 indexing servers. Each server has to calculate 1.000 indices > (totalling the 20.000) > We have the 5 doc-construction servers distribute the ordered sub-indices to > the correct servers. > Each server constructs an index from 5 ordered sub-indices coming from 5 > different construction-servers. This can be done efficiently using a > mergesort (since the sub-indices are already sorted) > > All that is missing (oversimplifying here ) is going from the ordered > indices in application-format to the index-format of lucene (substituting > the productids by the lucene docid's along the way) and stream it to disk. > I believe this would quite posisbly give a really big indexing improvement. > > Is my thinking correct in the steps involved? > Do you believe that this indeed would give a big speedup for this specific > situation > Where would I hook in the SOlr / lucene code to construct the native format? > > > Thanks in advance (and for making it to here) > > Geert-Jan > > -- > View this message in context: > http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html > Sent from the Solr - User mailing list archive at Nabble.com. |
|
|
Re: manually creating indices to speed up indexing with app-knowledgeThanks Otis,
The fileformat-info seems almost 100% accurate. The different Writer-classes indeed seem the way to go. I'll post to lucene-user for follow-ups if/when needed. Geert-Jan
|
| Free embeddable forum powered by Nabble | Forum Help |