importing a large collection

View: New views
11 Messages — Rating Filter:   Alert me  

importing a large collection

by hst :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me.
I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6.

My collection is  ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB.

Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each).

Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import.

My questions:
(1) are my numbers reasonable or do you think there might be sth wrong with my configuration
(2) is there anything I can cut let's say on order of a magnitude out of that process?

I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion.

Cheers,
hst

 

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Vyacheslav Sedov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

turn off index - load data - turn on index

Pablo Picasso  - "Computers are useless. They can only give you
answers." - http://www.brainyquote.com/quotes/authors/p/pablo_picasso.html


2009/11/4  <hstoermer@...>:

> Hi,
>
> I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me.
> I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6.
>
> My collection is  ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB.
>
> Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each).
>
> Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import.
>
> My questions:
> (1) are my numbers reasonable or do you think there might be sth wrong with my configuration
> (2) is there anything I can cut let's say on order of a magnitude out of that process?
>
> I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion.
>
> Cheers,
> hst
>
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Michael Sokolov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

One suggestion I've heard batted about on the list is to group documents
as nodes in a larger document (so you are loading fewer larger
documents).  I don't know - haven't tried this myself, but I would be
curious if that were the case.  I guess there could possibly be some
overhead creating documents that doesn't exist for mere nodes.  Another
possible culprit is collections - do you have your documents all in the
same folder (collection), in many many different folders, or what?  That
could presumably have some impact as well, since collections are
structurally significant in eXist.


-Mike

hstoermer@... wrote:

> Hi,
>
> I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me.
> I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6.
>
> My collection is  ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB.
>
> Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each).
>
> Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import.
>
> My questions:
> (1) are my numbers reasonable or do you think there might be sth wrong with my configuration
> (2) is there anything I can cut let's say on order of a magnitude out of that process?
>
> I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion.
>
> Cheers,
> hst
>
>  
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>  

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Dmitriy Shabanov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
> turn off index - load data - turn on index

+ reindex

Can you measure & report the speed with turned off indexes?

--
Cheers,

Dmitriy Shabanov


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd
> say on average 10 indexed nodes per document with maybe 3 tokens each).

So you only have one Lucene index enabled, no default ful text index
or the like?

> Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting
> for 900mio milliseconds, that would be about 10days (right?), for this import.

What is your setting for cacheSize in conf.xml? Most important: do you
see the ~90msec average load time right from the start or do you
observe a slow down over time?

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by hst :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear all,

thanks for the incredibly quick replies!

So here more details:
@ Mike: changing from single documents into a container setting unfortunately isn't really viable, the XML schema we are using is untouchable.

@Wolfgang: you are right, I also (accidently) have this in my collection.xconf
 <fulltext default="all" attributes="no" />
my cacheSize is 256M. 
I had it running for a little more than an hour now and I observed no slowdown.

@Dmitriy: importing into a different root collection that has no collection.xconf, average storage time per document (at least at the beginning, didn't try for very long) increased to ~4msec! That's very nice.

Question is: will reindexing then take 10days? :)

Thanks again!


On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...> wrote:
On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
> turn off index - load data - turn on index

+ reindex

Can you measure & report the speed with turned off indexes?

--
Cheers,

Dmitriy Shabanov



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by hst :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry, of course I meant that storage time _DE_creased. %-)

Heiko
--
Heiko Stoermer
University of Trento, Italy
Dept. of Information Science and Engineering (DISI)
http://disi.unitn.it/~stoermer
OKKAM id:
http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1



On Wed, Nov 4, 2009 at 3:45 PM, Heiko Stoermer <hstoermer@...> wrote:
Dear all,

thanks for the incredibly quick replies!

So here more details:
@ Mike: changing from single documents into a container setting unfortunately isn't really viable, the XML schema we are using is untouchable.

@Wolfgang: you are right, I also (accidently) have this in my collection.xconf
 <fulltext default="all" attributes="no" />
my cacheSize is 256M. 
I had it running for a little more than an hour now and I observed no slowdown.

@Dmitriy: importing into a different root collection that has no collection.xconf, average storage time per document (at least at the beginning, didn't try for very long) increased to ~4msec! That's very nice.

Question is: will reindexing then take 10days? :)

Thanks again!


On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...> wrote:
On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
> turn off index - load data - turn on index

+ reindex

Can you measure & report the speed with turned off indexes?

--
Cheers,

Dmitriy Shabanov




------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Vyacheslav Sedov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

i guess that reindexing should take less then 48 hours probably even
less then 24

Jonathan Swift  - "May you live every day of your life." -
http://www.brainyquote.com/quotes/authors/j/jonathan_swift.html


2009/11/4 Heiko Stoermer <hstoermer@...>:

> Sorry, of course I meant that storage time _DE_creased. %-)
> Heiko
> --
> Heiko Stoermer
> University of Trento, Italy
> Dept. of Information Science and Engineering (DISI)
> http://disi.unitn.it/~stoermer
> OKKAM id:
> http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1
>
>
>
> On Wed, Nov 4, 2009 at 3:45 PM, Heiko Stoermer <hstoermer@...> wrote:
>>
>> Dear all,
>> thanks for the incredibly quick replies!
>> So here more details:
>> @ Mike: changing from single documents into a container setting
>> unfortunately isn't really viable, the XML schema we are using is
>> untouchable.
>> @Wolfgang: you are right, I also (accidently) have this in my
>> collection.xconf
>>  <fulltext default="all" attributes="no" />
>> my cacheSize is 256M.
>> I had it running for a little more than an hour now and I observed no
>> slowdown.
>> @Dmitriy: importing into a different root collection that has no
>> collection.xconf, average storage time per document (at least at the
>> beginning, didn't try for very long) increased to ~4msec! That's very nice.
>> Question is: will reindexing then take 10days? :)
>> Thanks again!
>>
>> On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...>
>> wrote:
>>>
>>> On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
>>> > turn off index - load data - turn on index
>>>
>>> + reindex
>>>
>>> Can you measure & report the speed with turned off indexes?
>>>
>>> --
>>> Cheers,
>>>
>>> Dmitriy Shabanov
>>>
>>
>
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by hst :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear all,

so I ran the data import as suggested (deactivate index -> load -> reindex). Here some results:

Number of XML docs: 519233
Disk size of docs: 4566MB

Loading time without index: 1.5hrs
Reindex time: 8.5hrs
Disk size of eXist data dir: 9926MB
(of which index size ~ 1500MB)

=> combined storage+indexing time per document ~ 70msec

originally measured storage time per document _including_ indexing: 90msec (see OP)

=> accelleration ~ 20%

Unfortunately it seems that the multi-step approach (apart from requiring more attention) is not producing the desired improvement of an order of a magnitude. This is just to let people know that this approach seems not worth the effort.

Still hoping for a breakthrough idea from the community! :-)

Cheers,
Heiko



--
Heiko Stoermer
University of Trento, Italy
Dept. of Information Science and Engineering (DISI)
http://disi.unitn.it/~stoermer
OKKAM id:
http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1



On Wed, Nov 4, 2009 at 4:21 PM, Vyacheslav Sedov <vyacheslav.sedov@...> wrote:
i guess that reindexing should take less then 48 hours probably even
less then 24

Jonathan Swift  - "May you live every day of your life." -
http://www.brainyquote.com/quotes/authors/j/jonathan_swift.html


2009/11/4 Heiko Stoermer <hstoermer@...>:
> Sorry, of course I meant that storage time _DE_creased. %-)
> Heiko
> --
> Heiko Stoermer
> University of Trento, Italy
> Dept. of Information Science and Engineering (DISI)
> http://disi.unitn.it/~stoermer
> OKKAM id:
> http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1
>
>
>
> On Wed, Nov 4, 2009 at 3:45 PM, Heiko Stoermer <hstoermer@...> wrote:
>>
>> Dear all,
>> thanks for the incredibly quick replies!
>> So here more details:
>> @ Mike: changing from single documents into a container setting
>> unfortunately isn't really viable, the XML schema we are using is
>> untouchable.
>> @Wolfgang: you are right, I also (accidently) have this in my
>> collection.xconf
>>  <fulltext default="all" attributes="no" />
>> my cacheSize is 256M.
>> I had it running for a little more than an hour now and I observed no
>> slowdown.
>> @Dmitriy: importing into a different root collection that has no
>> collection.xconf, average storage time per document (at least at the
>> beginning, didn't try for very long) increased to ~4msec! That's very nice.
>> Question is: will reindexing then take 10days? :)
>> Thanks again!
>>
>> On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...>
>> wrote:
>>>
>>> On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
>>> > turn off index - load data - turn on index
>>>
>>> + reindex
>>>
>>> Can you measure & report the speed with turned off indexes?
>>>
>>> --
>>> Cheers,
>>>
>>> Dmitriy Shabanov
>>>
>>
>
>


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Evgeny Gazdovsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My two cents.
I'm suggest to use pragma "batch transaction" not only for transactions and triggers, but for reindex too - make reindex not after every update/storage, but only one - after finishing transaction under one pragma.

------
Evgeny

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: importing a large collection

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Unfortunately it seems that the multi-step approach (apart from requiring
> more attention) is not producing the desired improvement of an order of a
> magnitude.

Ok, I was a bit skeptical about the storing and reindexing approach as
well. However, it is interesting that the indexing takes 8 times
longer than just storing the data.

Did you disable the default="all" full text index? If not, please do.
In your original post, you said you created a Lucene index? Is this
the only index? If yes, we may need to look into Lucene optimizations.
There's still some potential in this area. Could you send me your
collection.xconf and maybe an example document, so I can get an
impression of the structure of your indexes?

Also, how many collections do you have?

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open