|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
Using Lucene with large sets of documents in a collectionMy team is fairly new to eXist, and we’re in the
process of porting an application from MarkLogic Server to eXist. We
found some interesting experiencing that might be useful to others trying to
make fragmenting decisions. I’d also like to know if this is
typical or expected behavior. NOTE: We are running eXist in tomcat on a Windows Server with
the db-connection cacheSize set to 1536Mb and the collection cache set to 48Mb. The app takes many types of documents, but the largest are
very large (up to 125Mb) . For various reasons our app chunks the documents
into about 10,000 “flattened” document chunks using a Java
app server before being stored in the database in a single collection. In
ML server this process took about 10 minutes to chunk, store, and index. The
following is our account of doing the same thing in eXist. First cut in eXist…. We were loading and indexing the “chunks” using lucene
into a single collection one at a time. It took over 5 hours before the
memory was consumed and the database crashed after saving around 2000
fragments. Interesting note is that the memory consumption got higher and
higher as the process continued. We were very discouraged and thought that
perhaps eXist wasn’t the right choice for our app. Second cut… After reading some posts from people doing similar things,
we decided to first turn off any full-text indexing for the collection, then
load the 10,000 fragments into about 30 sub-collections of 300-400 docs each,
then add the lucene definitions and do the full-text indexing on the whole
collection. This took <10 minutes to load the data into eXist, then it
took about 30 minutes to do the indexing. Although this was much better, the
database still consumed dangerously high amounts of memory, and any concurrent
use could take it down. Memory stayed high after the process completed,
and sometimes subsequent queries could take down the db. This was not
acceptable to even consider for a production app. Third cut… We decided to switch from chunking to flattening (to remove hierarchy
issues). Instead of chunking into separate docs, we now recombine the flattened
chunks into a single XML document that is about 140Mb in size. This was
the magic bullet for eXist/Lucene. The big flattened doc loads the XML
into eXist and builds the lucene indexes in 3-4 minutes total. Queries
are very fast, and the memory never goes above about 1/3 of availability during
the load process. We are once again encouraged. I want to re-iterate that this only became an issue when we
added the Lucene indexing on large numbers of documents. This approach
for us requires a fair amount of re-architecting of our application to change
what we had looking for full documents to instead look for nodes in the
flattened XML document. However, we’re very optimistic again
that this will provide all the full-text and XML aware searching that we were
using in ML Server. And if we ever need to also support ML this approach
should work with that platform as well. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Using Lucene with large sets of documents in a collectionVery interesting.
-- Cheers, Dmitriy Shabanov On Wed, 2009-10-07 at 19:51 -0600, Paul Vanderveen wrote: > My team is fairly new to eXist, and we’re in the process of porting an > application from MarkLogic Server to eXist. We found some > interesting experiencing that might be useful to others trying to make > fragmenting decisions. I’d also like to know if this is typical or > expected behavior. > > > > NOTE: We are running eXist in tomcat on a Windows Server with the > db-connection cacheSize set to 1536Mb and the collection cache set to > 48Mb. > > > > The app takes many types of documents, but the largest are very large > (up to 125Mb) . For various reasons our app chunks the documents into > about 10,000 “flattened” document chunks using a Java app server > before being stored in the database in a single collection. In ML > server this process took about 10 minutes to chunk, store, and > index. The following is our account of doing the same thing in > eXist. > > > > First cut in eXist…. > > We were loading and indexing the “chunks” using lucene into a single > collection one at a time. It took over 5 hours before the memory was > consumed and the database crashed after saving around 2000 fragments. > Interesting note is that the memory consumption got higher and higher > as the process continued. We were very discouraged and thought that > perhaps eXist wasn’t the right choice for our app. > > > > Second cut… > > After reading some posts from people doing similar things, we decided > to first turn off any full-text indexing for the collection, then load > the 10,000 fragments into about 30 sub-collections of 300-400 docs > each, then add the lucene definitions and do the full-text indexing on > the whole collection. This took <10 minutes to load the data into > eXist, then it took about 30 minutes to do the indexing. Although > this was much better, the database still consumed dangerously high > amounts of memory, and any concurrent use could take it down. Memory > stayed high after the process completed, and sometimes subsequent > queries could take down the db. This was not acceptable to even > consider for a production app. > > > > Third cut… > > We decided to switch from chunking to flattening (to remove hierarchy > issues). Instead of chunking into separate docs, we now recombine the > flattened chunks into a single XML document that is about 140Mb in > size. This was the magic bullet for eXist/Lucene. The big flattened > doc loads the XML into eXist and builds the lucene indexes in 3-4 > minutes total. Queries are very fast, and the memory never goes above > about 1/3 of availability during the load process. We are once again > encouraged. > > > > I want to re-iterate that this only became an issue when we added the > Lucene indexing on large numbers of documents. This approach for us > requires a fair amount of re-architecting of our application to change > what we had looking for full documents to instead look for nodes in > the flattened XML document. However, we’re very optimistic again > that this will provide all the full-text and XML aware searching that > we were using in ML Server. And if we ever need to also support ML > this approach should work with that platform as well. > > > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
|
|
|
Re: Using Lucene with large sets of documents in a collectionPaul - A very interesting case study - thank you for posting it!
To me, the difference between cut #1 and cut #2 can be summed up with the recommendations in Wolfgang's "Performance Tuning" article http://exist-db.org/tuning.html. But I don't think that article explains the performance improvements gained between cuts #2 and #3. I would be interested if someone familiar with eXist's internal workings could analyze the difference? Namely, why was cut #2 marked by high memory and low stability, whereas cut #3 brought improvements on both fronts? Of course, one can only know by looking at Paul's exact data, index definitions, and queries, but if there are any general lessons we can deduce from cuts #2 and #3, I'd be interested to know what they are. Thanks, Joe On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen <paul@...> wrote: > My team is fairly new to eXist, and we’re in the process of porting an > application from MarkLogic Server to eXist. We found some interesting > experiencing that might be useful to others trying to make fragmenting > decisions. I’d also like to know if this is typical or expected behavior. > > > > NOTE: We are running eXist in tomcat on a Windows Server with the > db-connection cacheSize set to 1536Mb and the collection cache set to 48Mb. > > > > The app takes many types of documents, but the largest are very large (up to > 125Mb) . For various reasons our app chunks the documents into about 10,000 > “flattened” document chunks using a Java app server before being stored in > the database in a single collection. In ML server this process took about > 10 minutes to chunk, store, and index. The following is our account of > doing the same thing in eXist. > > > > First cut in eXist…. > > We were loading and indexing the “chunks” using lucene into a single > collection one at a time. It took over 5 hours before the memory was > consumed and the database crashed after saving around 2000 fragments. > Interesting note is that the memory consumption got higher and higher as the > process continued. We were very discouraged and thought that perhaps eXist > wasn’t the right choice for our app. > > > > Second cut… > > After reading some posts from people doing similar things, we decided to > first turn off any full-text indexing for the collection, then load the > 10,000 fragments into about 30 sub-collections of 300-400 docs each, then > add the lucene definitions and do the full-text indexing on the whole > collection. This took <10 minutes to load the data into eXist, then it took > about 30 minutes to do the indexing. Although this was much better, the > database still consumed dangerously high amounts of memory, and any > concurrent use could take it down. Memory stayed high after the process > completed, and sometimes subsequent queries could take down the db. This > was not acceptable to even consider for a production app. > > > > Third cut… > > We decided to switch from chunking to flattening (to remove hierarchy > issues). Instead of chunking into separate docs, we now recombine the > flattened chunks into a single XML document that is about 140Mb in size. > This was the magic bullet for eXist/Lucene. The big flattened doc loads the > XML into eXist and builds the lucene indexes in 3-4 minutes total. Queries > are very fast, and the memory never goes above about 1/3 of availability > during the load process. We are once again encouraged. > > > > I want to re-iterate that this only became an issue when we added the Lucene > indexing on large numbers of documents. This approach for us requires a > fair amount of re-architecting of our application to change what we had > looking for full documents to instead look for nodes in the flattened XML > document. However, we’re very optimistic again that this will provide all > the full-text and XML aware searching that we were using in ML Server. And > if we ever need to also support ML this approach should work with that > platform as well. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Using Lucene with large sets of documents in a collectionHi,
Paul, are you saying you are using just one XML document (140mb doc) in exist? Is this a read only situation or will there be updates? -Rob > > > On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen > <paul@...> wrote: >> >> >> >> Third cut… >> >> We decided to switch from chunking to flattening (to remove hierarchy >> issues). Instead of chunking into separate docs, we now recombine >> the >> flattened chunks into a single XML document that is about 140Mb in >> size. >> This was the magic bullet for eXist/Lucene. The big flattened doc >> loads the >> XML into eXist and builds the lucene indexes in 3-4 minutes total. >> Queries >> are very fast, and the memory never goes above about 1/3 of >> availability >> during the load process. We are once again encouraged. >> >> >> >> I want to re-iterate that this only became an issue when we added >> the Lucene >> indexing on large numbers of documents. This approach for us >> requires a >> fair amount of re-architecting of our application to change what we >> had >> looking for full documents to instead look for nodes in the >> flattened XML >> document. However, we’re very optimistic again that this will >> provide all >> the full-text and XML aware searching that we were using in ML >> Server. And >> if we ever need to also support ML this approach should work with >> that >> platform as well. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Using Lucene with large sets of documents in a collectionHi,
Paul, are you saying you are using just one XML document (140mb doc) in exist? Is this a read only situation or will there be updates? -Rob > > > On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen > <paul@...> wrote: >> >> >> >> Third cut… >> >> We decided to switch from chunking to flattening (to remove hierarchy >> issues). Instead of chunking into separate docs, we now recombine >> the >> flattened chunks into a single XML document that is about 140Mb in >> size. >> This was the magic bullet for eXist/Lucene. The big flattened doc >> loads the >> XML into eXist and builds the lucene indexes in 3-4 minutes total. >> Queries >> are very fast, and the memory never goes above about 1/3 of >> availability >> during the load process. We are once again encouraged. >> >> >> >> I want to re-iterate that this only became an issue when we added >> the Lucene >> indexing on large numbers of documents. This approach for us >> requires a >> fair amount of re-architecting of our application to change what we >> had >> looking for full documents to instead look for nodes in the >> flattened XML >> document. However, we’re very optimistic again that this will >> provide all >> the full-text and XML aware searching that we were using in ML >> Server. And >> if we ever need to also support ML this approach should work with >> that >> platform as well. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Using Lucene with large sets of documents in a collectionRob,
Currently we have only two of these large documents installed, but we're in the very early stages of the porting process. In a production environment we expect to have many more of them. Perhaps 20 or more, as well as many smaller documents. Paul -----Original Message----- From: Robert Koberg [mailto:rob.koberg@...] Sent: Thursday, October 08, 2009 11:18 AM To: paul@... Cc: exist-open Subject: Re: [Exist-open] Using Lucene with large sets of documents in a collection Hi, Paul, are you saying you are using just one XML document (140mb doc) in exist? Is this a read only situation or will there be updates? -Rob > > > On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen > <paul@...> wrote: >> >> >> >> Third cut. >> >> We decided to switch from chunking to flattening (to remove hierarchy >> issues). Instead of chunking into separate docs, we now recombine >> the >> flattened chunks into a single XML document that is about 140Mb in >> size. >> This was the magic bullet for eXist/Lucene. The big flattened doc >> loads the >> XML into eXist and builds the lucene indexes in 3-4 minutes total. >> Queries >> are very fast, and the memory never goes above about 1/3 of >> availability >> during the load process. We are once again encouraged. >> >> >> >> I want to re-iterate that this only became an issue when we added >> the Lucene >> indexing on large numbers of documents. This approach for us >> requires a >> fair amount of re-architecting of our application to change what we >> had >> looking for full documents to instead look for nodes in the >> flattened XML >> document. However, we're very optimistic again that this will >> provide all >> the full-text and XML aware searching that we were using in ML >> Server. And >> if we ever need to also support ML this approach should work with >> that >> platform as well. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Using Lucene with large sets of documents in a collectionOne question I still have is why the indexing takes SO much memory and if there's any way to limit it. On a small doc (< 10mb), the indexing takes a few seconds and doesn't bog down the machine. On a large doc (the 100+ mb variety), the indexing does complete in a few minutes, but it takes most of the memory and virtually brings the machine to a standstill in the process. In a production environment this is unacceptable -- you pretty much have to say that no one can use the system while a big document is reloading, and this happens at least weekly. Is it possible to limit the memory or downgrade the priority of the indexing process? Also, it seems to me from observing the indexer behavior that it's possible that the code is trying to parallelize this operation. If that's true, it may be doing more harm than good as in a large document you will run out of resources. Or do we just need bigger 64 bit computers with lots of processors and available memory? |
|
|
Re: Using Lucene with large sets of documents in a collection> On a large doc (the 100+ mb
> variety), the indexing does complete in a few minutes, but it takes most of > the memory and virtually brings the machine to a standstill in the process. While storing a document, the new index data is buffered as long as there's enough memory available. Usually the indexer will not write the data to disk until it hits the memory limit. This has some advantages performance-wise. I understand your argument and I agree the current behavior may not always be desirable. We could indeed add another configuration option to define the max amount of memory to be used for internal caching. Changing the priority could also be possible. > Also, it seems to me from > observing the indexer behavior that it's possible that the code is trying to > parallelize this operation. If that's true, it may be doing more harm than > good as in a large document you will run out of resources. The indexing is not parallelized. All operations are running within the same thread. Wolfgang ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
| Free embeddable forum powered by Nabble | Forum Help |