Using Lucene with large sets of documents in a collection

View: New views
9 Messages — Rating Filter:   Alert me  

Using Lucene with large sets of documents in a collection

by Paul Vanderveen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

My team is fairly new to eXist, and we’re in the process of porting an application from MarkLogic Server to eXist.   We found some interesting experiencing that might be useful to others trying to make fragmenting decisions.   I’d also like to know if this is typical or expected behavior.

 

NOTE: We are running eXist in tomcat on a Windows Server with the db-connection cacheSize set to 1536Mb and the collection cache set to 48Mb.

 

The app takes many types of documents, but the largest are very large (up to 125Mb) .  For various reasons our app chunks the documents into about 10,000 “flattened” document  chunks using a Java app server before being stored in the database in a single collection.  In ML server this process took about 10 minutes to chunk, store, and index.   The following is our account of doing the same thing in eXist.

 

First cut in eXist….

We were loading and indexing the “chunks” using lucene into a single collection one at a time.  It took over 5 hours before the memory was consumed and the database crashed after saving around 2000 fragments.  Interesting note is that the memory consumption got higher and higher as the process continued. We were very discouraged and thought that perhaps eXist wasn’t the right choice for our app. 

 

Second cut…  

After reading some posts from people doing similar things, we decided to first turn off any full-text indexing for the collection, then load the 10,000 fragments into about 30 sub-collections of 300-400 docs each, then add the lucene definitions and do the full-text indexing on the whole collection.  This took <10 minutes to load the data into eXist, then it took about 30 minutes to do the indexing.  Although this was much better, the database still consumed dangerously high amounts of memory, and any concurrent use could take it down.  Memory stayed high after the process completed, and sometimes subsequent queries could take down the db.  This was not acceptable to even consider for a production app.

 

Third cut…

We decided to switch from chunking to flattening (to remove hierarchy issues).  Instead of chunking into separate docs, we now recombine the flattened chunks into a single XML document that is about 140Mb in size.  This was the magic bullet for eXist/Lucene.  The big flattened doc loads the XML into eXist and builds the lucene indexes in 3-4 minutes total.  Queries are very fast, and the memory never goes above about 1/3 of availability during the load process.  We are once again encouraged.

 

I want to re-iterate that this only became an issue when we added the Lucene indexing on large numbers of documents.  This approach for us requires a fair amount of re-architecting of our application to change what we had looking for full documents to instead look for nodes in the flattened XML document.   However, we’re very optimistic again that this will provide all the full-text and XML aware searching that we were using in ML Server.  And if we ever need to also support ML this approach should work with that platform as well.

 


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by Dmitriy Shabanov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Very interesting.

--
Cheers,

Dmitriy Shabanov

On Wed, 2009-10-07 at 19:51 -0600, Paul Vanderveen wrote:

> My team is fairly new to eXist, and we’re in the process of porting an
> application from MarkLogic Server to eXist.   We found some
> interesting experiencing that might be useful to others trying to make
> fragmenting decisions.   I’d also like to know if this is typical or
> expected behavior.
>
>  
>
> NOTE: We are running eXist in tomcat on a Windows Server with the
> db-connection cacheSize set to 1536Mb and the collection cache set to
> 48Mb.
>
>  
>
> The app takes many types of documents, but the largest are very large
> (up to 125Mb) .  For various reasons our app chunks the documents into
> about 10,000 “flattened” document  chunks using a Java app server
> before being stored in the database in a single collection.  In ML
> server this process took about 10 minutes to chunk, store, and
> index.   The following is our account of doing the same thing in
> eXist.
>
>  
>
> First cut in eXist….
>
> We were loading and indexing the “chunks” using lucene into a single
> collection one at a time.  It took over 5 hours before the memory was
> consumed and the database crashed after saving around 2000 fragments.
> Interesting note is that the memory consumption got higher and higher
> as the process continued. We were very discouraged and thought that
> perhaps eXist wasn’t the right choice for our app.  
>
>  
>
> Second cut…  
>
> After reading some posts from people doing similar things, we decided
> to first turn off any full-text indexing for the collection, then load
> the 10,000 fragments into about 30 sub-collections of 300-400 docs
> each, then add the lucene definitions and do the full-text indexing on
> the whole collection.  This took <10 minutes to load the data into
> eXist, then it took about 30 minutes to do the indexing.  Although
> this was much better, the database still consumed dangerously high
> amounts of memory, and any concurrent use could take it down.  Memory
> stayed high after the process completed, and sometimes subsequent
> queries could take down the db.  This was not acceptable to even
> consider for a production app.
>
>  
>
> Third cut…
>
> We decided to switch from chunking to flattening (to remove hierarchy
> issues).  Instead of chunking into separate docs, we now recombine the
> flattened chunks into a single XML document that is about 140Mb in
> size.  This was the magic bullet for eXist/Lucene.  The big flattened
> doc loads the XML into eXist and builds the lucene indexes in 3-4
> minutes total.  Queries are very fast, and the memory never goes above
> about 1/3 of availability during the load process.  We are once again
> encouraged.
>
>  
>
> I want to re-iterate that this only became an issue when we added the
> Lucene indexing on large numbers of documents.  This approach for us
> requires a fair amount of re-architecting of our application to change
> what we had looking for full documents to instead look for nodes in
> the flattened XML document.   However, we’re very optimistic again
> that this will provide all the full-text and XML aware searching that
> we were using in ML Server.  And if we ever need to also support ML
> this approach should work with that platform as well.
>
>  
>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Parent Message unknown Re: Using Lucene with large sets of documents in a collection

by Dan McCreary :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Paul,

Thank you very much for taking the time to post this case study.  I wish more people shared their victories.

I think that there are many, many users that give up on eXist too early because they do not take the time to understand how to optimize the process of loading and indexing process.

I feel very sorry for the people who abandon eXist and try to use other tools like key-value stores only to find there are no way to run reports on these systems... :-O

We appreciate your patience and your willingness to have faith in eXist and hope that you contribute to the code base in the future.  The team has been working hard to create a great product and we feel that eXist is really getting better each month.

If you have any specific suggestions on what we can do to make this eXist evaluation and adoption process easier, let us know.  I know that you will be seeing more documentation in the near future on many aspects of initial setup.

Good Luck! - Dan

syntactica.com


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Paul - A very interesting case study - thank you for posting it!

To me, the difference between cut #1 and cut #2 can be summed up with
the recommendations in Wolfgang's "Performance Tuning" article
http://exist-db.org/tuning.html.  But I don't think that article
explains the performance improvements gained between cuts #2 and #3.

I would be interested if someone familiar with eXist's internal
workings could analyze the difference?  Namely, why was cut #2 marked
by high memory and low stability, whereas cut #3 brought improvements
on both fronts?

Of course, one can only know by looking at Paul's exact data, index
definitions, and queries, but if there are any general lessons we can
deduce from cuts #2 and #3, I'd be interested to know what they are.

Thanks,
Joe


On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen <paul@...> wrote:

> My team is fairly new to eXist, and we’re in the process of porting an
> application from MarkLogic Server to eXist.   We found some interesting
> experiencing that might be useful to others trying to make fragmenting
> decisions.   I’d also like to know if this is typical or expected behavior.
>
>
>
> NOTE: We are running eXist in tomcat on a Windows Server with the
> db-connection cacheSize set to 1536Mb and the collection cache set to 48Mb.
>
>
>
> The app takes many types of documents, but the largest are very large (up to
> 125Mb) .  For various reasons our app chunks the documents into about 10,000
> “flattened” document  chunks using a Java app server before being stored in
> the database in a single collection.  In ML server this process took about
> 10 minutes to chunk, store, and index.   The following is our account of
> doing the same thing in eXist.
>
>
>
> First cut in eXist….
>
> We were loading and indexing the “chunks” using lucene into a single
> collection one at a time.  It took over 5 hours before the memory was
> consumed and the database crashed after saving around 2000 fragments.
> Interesting note is that the memory consumption got higher and higher as the
> process continued. We were very discouraged and thought that perhaps eXist
> wasn’t the right choice for our app.
>
>
>
> Second cut…
>
> After reading some posts from people doing similar things, we decided to
> first turn off any full-text indexing for the collection, then load the
> 10,000 fragments into about 30 sub-collections of 300-400 docs each, then
> add the lucene definitions and do the full-text indexing on the whole
> collection.  This took <10 minutes to load the data into eXist, then it took
> about 30 minutes to do the indexing.  Although this was much better, the
> database still consumed dangerously high amounts of memory, and any
> concurrent use could take it down.  Memory stayed high after the process
> completed, and sometimes subsequent queries could take down the db.  This
> was not acceptable to even consider for a production app.
>
>
>
> Third cut…
>
> We decided to switch from chunking to flattening (to remove hierarchy
> issues).  Instead of chunking into separate docs, we now recombine the
> flattened chunks into a single XML document that is about 140Mb in size.
> This was the magic bullet for eXist/Lucene.  The big flattened doc loads the
> XML into eXist and builds the lucene indexes in 3-4 minutes total.  Queries
> are very fast, and the memory never goes above about 1/3 of availability
> during the load process.  We are once again encouraged.
>
>
>
> I want to re-iterate that this only became an issue when we added the Lucene
> indexing on large numbers of documents.  This approach for us requires a
> fair amount of re-architecting of our application to change what we had
> looking for full documents to instead look for nodes in the flattened XML
> document.   However, we’re very optimistic again that this will provide all
> the full-text and XML aware searching that we were using in ML Server.  And
> if we ever need to also support ML this approach should work with that
> platform as well.

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by Robert Koberg-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Paul, are you saying you are using just one XML document (140mb doc)  
in exist? Is this a read only situation or will there be updates?

-Rob

>
>
> On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen  
> <paul@...> wrote:
>>

>>
>>
>> Third cut…
>>
>> We decided to switch from chunking to flattening (to remove hierarchy
>> issues).  Instead of chunking into separate docs, we now recombine  
>> the
>> flattened chunks into a single XML document that is about 140Mb in  
>> size.
>> This was the magic bullet for eXist/Lucene.  The big flattened doc  
>> loads the
>> XML into eXist and builds the lucene indexes in 3-4 minutes total.  
>> Queries
>> are very fast, and the memory never goes above about 1/3 of  
>> availability
>> during the load process.  We are once again encouraged.
>>
>>
>>
>> I want to re-iterate that this only became an issue when we added  
>> the Lucene
>> indexing on large numbers of documents.  This approach for us  
>> requires a
>> fair amount of re-architecting of our application to change what we  
>> had
>> looking for full documents to instead look for nodes in the  
>> flattened XML
>> document.   However, we’re very optimistic again that this will  
>> provide all
>> the full-text and XML aware searching that we were using in ML  
>> Server.  And
>> if we ever need to also support ML this approach should work with  
>> that
>> platform as well.


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by Robert Koberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Paul, are you saying you are using just one XML document (140mb doc)  
in exist? Is this a read only situation or will there be updates?

-Rob

>
>
> On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen  
> <paul@...> wrote:
>>

>>
>>
>> Third cut…
>>
>> We decided to switch from chunking to flattening (to remove hierarchy
>> issues).  Instead of chunking into separate docs, we now recombine  
>> the
>> flattened chunks into a single XML document that is about 140Mb in  
>> size.
>> This was the magic bullet for eXist/Lucene.  The big flattened doc  
>> loads the
>> XML into eXist and builds the lucene indexes in 3-4 minutes total.  
>> Queries
>> are very fast, and the memory never goes above about 1/3 of  
>> availability
>> during the load process.  We are once again encouraged.
>>
>>
>>
>> I want to re-iterate that this only became an issue when we added  
>> the Lucene
>> indexing on large numbers of documents.  This approach for us  
>> requires a
>> fair amount of re-architecting of our application to change what we  
>> had
>> looking for full documents to instead look for nodes in the  
>> flattened XML
>> document.   However, we’re very optimistic again that this will  
>> provide all
>> the full-text and XML aware searching that we were using in ML  
>> Server.  And
>> if we ever need to also support ML this approach should work with  
>> that
>> platform as well.


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by Paul Vanderveen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rob,

Currently we have only two of these large documents installed, but we're in
the very early stages of the porting process.  In a production environment
we expect to have many more of them.  Perhaps 20 or more, as well as many
smaller documents.

Paul

-----Original Message-----
From: Robert Koberg [mailto:rob.koberg@...]
Sent: Thursday, October 08, 2009 11:18 AM
To: paul@...
Cc: exist-open
Subject: Re: [Exist-open] Using Lucene with large sets of documents in a
collection

Hi,

Paul, are you saying you are using just one XML document (140mb doc)  
in exist? Is this a read only situation or will there be updates?

-Rob

>
>
> On Wed, Oct 7, 2009 at 9:51 PM, Paul Vanderveen  
> <paul@...> wrote:
>>

>>
>>
>> Third cut.
>>
>> We decided to switch from chunking to flattening (to remove hierarchy
>> issues).  Instead of chunking into separate docs, we now recombine  
>> the
>> flattened chunks into a single XML document that is about 140Mb in  
>> size.
>> This was the magic bullet for eXist/Lucene.  The big flattened doc  
>> loads the
>> XML into eXist and builds the lucene indexes in 3-4 minutes total.  
>> Queries
>> are very fast, and the memory never goes above about 1/3 of  
>> availability
>> during the load process.  We are once again encouraged.
>>
>>
>>
>> I want to re-iterate that this only became an issue when we added  
>> the Lucene
>> indexing on large numbers of documents.  This approach for us  
>> requires a
>> fair amount of re-architecting of our application to change what we  
>> had
>> looking for full documents to instead look for nodes in the  
>> flattened XML
>> document.   However, we're very optimistic again that this will  
>> provide all
>> the full-text and XML aware searching that we were using in ML  
>> Server.  And
>> if we ever need to also support ML this approach should work with  
>> that
>> platform as well.


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Using Lucene with large sets of documents in a collection

by VanP :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Dan McCreary wrote:
If you have any specific suggestions on what we can do to make this eXist
evaluation and adoption process easier, let us know.
One question I still have is why the indexing takes SO much memory and if there's any way to limit it.  On a small doc (< 10mb), the indexing takes a few seconds and doesn't bog down the machine.  On a large doc (the 100+ mb variety), the indexing does complete in a few minutes, but it takes most of the memory and virtually brings the machine to a standstill in the process.  In a production environment this is unacceptable -- you pretty much have to say that no one can use the system while a big document is reloading, and this happens at least weekly.   Is it possible to limit the memory or downgrade the priority of the indexing process?  Also, it seems to me from observing the indexer behavior that it's possible that the code is trying to parallelize this operation.  If that's true, it may be doing more harm than good as in a large document you will run out of resources.  

Or do we just need bigger 64 bit computers with lots of processors and available memory?

Re: Using Lucene with large sets of documents in a collection

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> On a large doc (the 100+ mb
> variety), the indexing does complete in a few minutes, but it takes most of
> the memory and virtually brings the machine to a standstill in the process.

While storing a document, the new index data is buffered as long as
there's enough memory available. Usually the indexer will not write the
data to disk until it hits the memory limit. This has some advantages
performance-wise.

I understand your argument and I agree the current behavior may not
always be desirable. We could indeed add another configuration option to
define the max amount of memory to be used for internal caching.
Changing the priority could also be possible.

> Also, it seems to me from
> observing the indexer behavior that it's possible that the code is trying to
> parallelize this operation.  If that's true, it may be doing more harm than
> good as in a large document you will run out of resources.  

The indexing is not parallelized. All operations are running within the
same thread.

Wolfgang

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open