|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
|
|
importing a large collectionHi,
I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me. I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6. My collection is ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB. Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each). Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import. My questions: (1) are my numbers reasonable or do you think there might be sth wrong with my configuration (2) is there anything I can cut let's say on order of a magnitude out of that process? I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion. Cheers, hst ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionturn off index - load data - turn on index
Pablo Picasso - "Computers are useless. They can only give you answers." - http://www.brainyquote.com/quotes/authors/p/pablo_picasso.html 2009/11/4 <hstoermer@...>: > Hi, > > I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me. > I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6. > > My collection is ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB. > > Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each). > > Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import. > > My questions: > (1) are my numbers reasonable or do you think there might be sth wrong with my configuration > (2) is there anything I can cut let's say on order of a magnitude out of that process? > > I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion. > > Cheers, > hst > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Exist-open mailing list > Exist-open@... > https://lists.sourceforge.net/lists/listinfo/exist-open > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionOne suggestion I've heard batted about on the list is to group documents
as nodes in a larger document (so you are loading fewer larger documents). I don't know - haven't tried this myself, but I would be curious if that were the case. I guess there could possibly be some overhead creating documents that doesn't exist for mere nodes. Another possible culprit is collections - do you have your documents all in the same folder (collection), in many many different folders, or what? That could presumably have some impact as well, since collections are structurally significant in eXist. -Mike hstoermer@... wrote: > Hi, > > I'm aware that this is a recurring topic on the ML but so far I haven't been able to find a solution, so please bear with me. > I have eXist 1.4RC10028 running on an 8-core, 16RAM server machine,. 64bit linux, JDK1.6. > > My collection is ~10mio documents, roughly 10GB on disk. I've done first experiments with loading these files into eXist with the commandline client (local mode), and have upped JVM memory to several GB. > > Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd say on average 10 indexed nodes per document with maybe 3 tokens each). > > Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting for 900mio milliseconds, that would be about 10days (right?), for this import. > > My questions: > (1) are my numbers reasonable or do you think there might be sth wrong with my configuration > (2) is there anything I can cut let's say on order of a magnitude out of that process? > > I'll be having to do such bulk-loading more than once most probably, so sitting there for 10days every time is not really an option. I'm open to any suggestion. > > Cheers, > hst > > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > Exist-open mailing list > Exist-open@... > https://lists.sourceforge.net/lists/listinfo/exist-open > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionOn Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote:
> turn off index - load data - turn on index + reindex Can you measure & report the speed with turned off indexes? -- Cheers, Dmitriy Shabanov ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collection> Additionally I've configured a lucene fulltext index (defined via "qname") on a small subsection of the documents (I'd
> say on average 10 indexed nodes per document with maybe 3 tokens each). So you only have one Lucene index enabled, no default ful text index or the like? > Right now it seems that the average load time per document is ~90msec. Doing that for 10mio documents, I'll be waiting > for 900mio milliseconds, that would be about 10days (right?), for this import. What is your setting for cacheSize in conf.xml? Most important: do you see the ~90msec average load time right from the start or do you observe a slow down over time? Wolfgang ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionDear all,
thanks for the incredibly quick replies! So here more details: @ Mike: changing from single documents into a container setting unfortunately isn't really viable, the XML schema we are using is untouchable. @Wolfgang: you are right, I also (accidently) have this in my collection.xconf
<fulltext default="all" attributes="no" /> my cacheSize is 256M.
I had it running for a little more than an hour now and I observed no slowdown. @Dmitriy: importing into a different root collection that has no collection.xconf, average storage time per document (at least at the beginning, didn't try for very long) increased to ~4msec! That's very nice. Question is: will reindexing then take 10days? :) Thanks again! On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...> wrote:
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionSorry, of course I meant that storage time _DE_creased. %-)
Heiko
-- Heiko Stoermer University of Trento, Italy Dept. of Information Science and Engineering (DISI) http://disi.unitn.it/~stoermer OKKAM id: http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1 On Wed, Nov 4, 2009 at 3:45 PM, Heiko Stoermer <hstoermer@...> wrote: Dear all, ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectioni guess that reindexing should take less then 48 hours probably even
less then 24 Jonathan Swift - "May you live every day of your life." - http://www.brainyquote.com/quotes/authors/j/jonathan_swift.html 2009/11/4 Heiko Stoermer <hstoermer@...>: > Sorry, of course I meant that storage time _DE_creased. %-) > Heiko > -- > Heiko Stoermer > University of Trento, Italy > Dept. of Information Science and Engineering (DISI) > http://disi.unitn.it/~stoermer > OKKAM id: > http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1 > > > > On Wed, Nov 4, 2009 at 3:45 PM, Heiko Stoermer <hstoermer@...> wrote: >> >> Dear all, >> thanks for the incredibly quick replies! >> So here more details: >> @ Mike: changing from single documents into a container setting >> unfortunately isn't really viable, the XML schema we are using is >> untouchable. >> @Wolfgang: you are right, I also (accidently) have this in my >> collection.xconf >> <fulltext default="all" attributes="no" /> >> my cacheSize is 256M. >> I had it running for a little more than an hour now and I observed no >> slowdown. >> @Dmitriy: importing into a different root collection that has no >> collection.xconf, average storage time per document (at least at the >> beginning, didn't try for very long) increased to ~4msec! That's very nice. >> Question is: will reindexing then take 10days? :) >> Thanks again! >> >> On Wed, Nov 4, 2009 at 3:20 PM, Dmitriy Shabanov <shabanovd@...> >> wrote: >>> >>> On Wed, 2009-11-04 at 17:08 +0300, Vyacheslav Sedov wrote: >>> > turn off index - load data - turn on index >>> >>> + reindex >>> >>> Can you measure & report the speed with turned off indexes? >>> >>> -- >>> Cheers, >>> >>> Dmitriy Shabanov >>> >> > > ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionDear all,
so I ran the data import as suggested (deactivate index -> load -> reindex). Here some results: Number of XML docs: 519233 Disk size of docs: 4566MB
Loading time without index: 1.5hrs Reindex time: 8.5hrs Disk size of eXist data dir: 9926MB (of which index size ~ 1500MB) => combined storage+indexing time per document ~ 70msec
originally measured storage time per document _including_ indexing: 90msec (see OP) => accelleration ~ 20% Unfortunately it seems that the multi-step approach (apart from requiring more attention) is not producing the desired improvement of an order of a magnitude. This is just to let people know that this approach seems not worth the effort.
Still hoping for a breakthrough idea from the community! :-) Cheers, Heiko --
Heiko Stoermer University of Trento, Italy Dept. of Information Science and Engineering (DISI) http://disi.unitn.it/~stoermer OKKAM id: http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1 On Wed, Nov 4, 2009 at 4:21 PM, Vyacheslav Sedov <vyacheslav.sedov@...> wrote: i guess that reindexing should take less then 48 hours probably even ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collectionMy two cents.
I'm suggest to use pragma "batch transaction" not only for transactions and triggers, but for reindex too - make reindex not after every update/storage, but only one - after finishing transaction under one pragma. ------ Evgeny ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: importing a large collection> Unfortunately it seems that the multi-step approach (apart from requiring
> more attention) is not producing the desired improvement of an order of a > magnitude. Ok, I was a bit skeptical about the storing and reindexing approach as well. However, it is interesting that the indexing takes 8 times longer than just storing the data. Did you disable the default="all" full text index? If not, please do. In your original post, you said you created a Lucene index? Is this the only index? If yes, we may need to look into Lucene optimizations. There's still some potential in this area. Could you send me your collection.xconf and maybe an example document, so I can get an impression of the structure of your indexes? Also, how many collections do you have? Wolfgang ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
| Free embeddable forum powered by Nabble | Forum Help |