|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
xindice_rebuild - file size vastly multipliedRan xindice_rebuild on a smallish existing 1.0 database to get it into 1.1 format.
The size was expanded enormously: db 1.0: 2 M db 1.1: 233 M That is a factor of 100 ;-( I have not dared do this on my real 1.0 database, which now occupies more than 1G of disk space. *Question*: - Is this a feature or a bug? Tool info as presented: $ xindice -h trying to register database Xindice Command Tools v1.1 ...etc... /O |
|
|
Re: xindice_rebuild - file size vastly multipliedThere was a change in Xindice v1.1 that could possibly be responsible
for database size increase. Xindice v1.0 failed to correctly allocate initial file space for a collection according to its page size and page count parameters, so a collection with just a few documents would occupy several Kb on disk instead of reserving some space on disk for the collection to grow. It was fixed in v1.1 and a collection with default page size (4Kb) and page count (1024) parameters now occupies about 4Mb. If you had several small collections in Xindice v1.0 it is possible that after rebuilding them for v1.1 the database would take considerably more disk space. How many collections do you have in that database and how big are the collections? If this indeed is the reason for the database increase, it could be fixed by adjusting page count parameter for small collections. Also, I think Meta collections were introduced somewhere between 1.0 and 1.1, I cannot remember now if they are created when rebuilding a database, but if they are, it would take some disk space, too. You can explore database directories to see if Meta collections are there (in system/Metas, I think). Meta collections can be turned off. Natalia On Thu, Apr 16, 2009 at 9:35 AM, OKO <olleo@...> wrote: > > Ran xindice_rebuild on a smallish existing 1.0 database to get it into 1.1 > format. > The size was expanded enormously: > db 1.0: 2 M > db 1.1: 233 M > > That is a factor of 100 ;-( > > I have not dared do this on my real 1.0 database, which now occupies more > than 1G of disk space. > > *Question*: > - Is this a feature or a bug? > > Tool info as presented: > $ xindice -h > trying to register database > > Xindice Command Tools v1.1 > > ...etc... > > /O > -- > View this message in context: http://www.nabble.com/xindice_rebuild---file-size-vastly-multiplied-tp23078111p23078111.html > Sent from the Xindice - Users mailing list archive at Nabble.com. > > |
|
|
Re: xindice_rebuild - file size vastly multipliedNatalia,
Thanks for offering some hints. To put some data on the table ... Below some size quantities for two parts of the database (1) a collection with few and small documents (2) a collection with many small documents For each of these I give size of (A) exported format (B) xindice 1.0 database (C) xindice 1.1 database Sizes are computed by "du" (running Cygwin). Summary table of sizes in KB (table more readable in fixed font ;-) ): . | (A) | (B) | (C) | . (1) | 20 | 250 | 84000 | . (2) | 4000 | 450 | 12000 | The "1.1" vs "1.0" expansion factor is much larger for the case of collections with few and small files (factor 350) , than for many and small files (factor 25). But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0 from a *size* point of view. Maybe comparing processing speed of "1.1" vs "1.0" would give a different picture, but at the moment I have no numbers to offer in that respect. By the way, I do not have any very large documents in the database, so I have no idea how "1.1" compares to "1.0" for such beasts. And the 64,000 $ question is, of course, what means are available to decrease disk space footprint of the "1.1" database? /O ========================== (1) a set of collections with few and small documents (A) This is a size listing of (part of) the exported 1.0 database. Result: total of 20KB contents. Most of these files are small. (Note: the "meta" collections/files are my own meta info of the application level, so not Xindice meta info!). xxx$ du -ba publications 553 publications/rss-common/meta/meta.xml 1065 publications/rss-common/meta 1577 publications/rss-common 257 publications/rss_2_0/meta/meta.xml 769 publications/rss_2_0/meta 1281 publications/rss_2_0 260 publications/rss_0_92/meta/meta.xml 772 publications/rss_0_92/meta 1284 publications/rss_0_92 260 publications/rss_0_91/meta/meta.xml 772 publications/rss_0_91/meta 1284 publications/rss_0_91 438 publications/newsletter/meta/meta.xml 950 publications/newsletter/meta 1462 publications/newsletter 512 publications/archive/meta 1024 publications/archive 429 publications/newsletter-index/meta/meta.xml 941 publications/newsletter-index/meta 1453 publications/newsletter-index 427 publications/news-archive/meta/meta.xml 939 publications/news-archive/meta 1451 publications/news-archive 394 publications/home-page/meta/meta.xml 906 publications/home-page/meta 1418 publications/home-page 257 publications/rss_1_0/meta/meta.xml 769 publications/rss_1_0/meta 1281 publications/rss_1_0 315 publications/home-page-time-stamp/meta/meta.xml 827 publications/home-page-time-stamp/meta 1339 publications/home-page-time-stamp 350 publications/events/meta/meta.xml 862 publications/events/meta 1374 publications/events 1634 publications/meta/meta.xml 2146 publications/meta 18886 publications xxx$ (B) The size of the corresponding 1.0 database files Result: total of 250KB contents. yyy$ du -ba publications/ 12288 publications/archive/archive.tbl 12288 publications/archive/meta/meta.tbl 12288 publications/archive/meta 24576 publications/archive 12288 publications/home-page/home-page.tbl 12288 publications/home-page/meta/meta.tbl 12288 publications/home-page/meta 24576 publications/home-page 12288 publications/meta/meta.tbl 12288 publications/meta 12288 publications/news-archive/meta/meta.tbl 12288 publications/news-archive/meta 12288 publications/news-archive/news-archive.tbl 24576 publications/news-archive 12288 publications/newsletter/meta/meta.tbl 12288 publications/newsletter/meta 12288 publications/newsletter/newsletter.tbl 24576 publications/newsletter 12288 publications/publications.tbl 12288 publications/rss-common/meta/meta.tbl 12288 publications/rss-common/meta 12288 publications/rss-common/rss-common.tbl 24576 publications/rss-common 12288 publications/rss_0_91/meta/meta.tbl 12288 publications/rss_0_91/meta 12288 publications/rss_0_91/rss_0_91.tbl 24576 publications/rss_0_91 12288 publications/rss_0_92/meta/meta.tbl 12288 publications/rss_0_92/meta 12288 publications/rss_0_92/rss_0_92.tbl 24576 publications/rss_0_92 12288 publications/rss_1_0/meta/meta.tbl 12288 publications/rss_1_0/meta 12288 publications/rss_1_0/rss_1_0.tbl 24576 publications/rss_1_0 12288 publications/rss_2_0/meta/meta.tbl 12288 publications/rss_2_0/meta 12288 publications/rss_2_0/rss_2_0.tbl 24576 publications/rss_2_0 245760 publications/ yyy$ (C) ... and of the corresponding 1.1 database files Result: total of 84000KB contents. zzz$ du -ba publications/ 4202496 publications/archive/archive.tbl 4202496 publications/archive/meta/meta.tbl 4202496 publications/archive/meta 8404992 publications/archive 4202496 publications/home-page/home-page.tbl 4202496 publications/home-page/meta/meta.tbl 4202496 publications/home-page/meta 8404992 publications/home-page 4202496 publications/meta/meta.tbl 4202496 publications/meta 4202496 publications/news-archive/meta/meta.tbl 4202496 publications/news-archive/meta 4202496 publications/news-archive/news-archive.tbl 8404992 publications/news-archive 4202496 publications/newsletter/meta/meta.tbl 4202496 publications/newsletter/meta 4202496 publications/newsletter/newsletter.tbl 8404992 publications/newsletter 4202496 publications/publications.tbl 4202496 publications/rss-common/meta/meta.tbl 4202496 publications/rss-common/meta 4202496 publications/rss-common/rss-common.tbl 8404992 publications/rss-common 4202496 publications/rss_0_91/meta/meta.tbl 4202496 publications/rss_0_91/meta 4202496 publications/rss_0_91/rss_0_91.tbl 8404992 publications/rss_0_91 4202496 publications/rss_0_92/meta/meta.tbl 4202496 publications/rss_0_92/meta 4202496 publications/rss_0_92/rss_0_92.tbl 8404992 publications/rss_0_92 4202496 publications/rss_1_0/meta/meta.tbl 4202496 publications/rss_1_0/meta 4202496 publications/rss_1_0/rss_1_0.tbl 8404992 publications/rss_1_0 4202496 publications/rss_2_0/meta/meta.tbl 4202496 publications/rss_2_0/meta 4202496 publications/rss_2_0/rss_2_0.tbl 8404992 publications/rss_2_0 84049920 publications/ zzz$ ========================== (2) a collection with a large set of small documents (A) This is a size listing of (part of) the exported 1.0 database. This one contains approx 4000 rather small documents. Result: total of 4000KB contents (so approx 1KB per document) xxx$ du -b content/news 3932206 content/news/db 665 content/news/meta 3933383 content/news xxx$ (note: not displaying individual files/documents ... there are too many of them) (B) The size of the corresponding 1.0 database files Result: total of 450KB contents. yyy$ du -ba content/news/ 430080 content/news/db/db.tbl 430080 content/news/db 12288 content/news/meta/meta.tbl 12288 content/news/meta 12288 content/news/news.tbl 454656 content/news/ yyy$ (C) ... and of the corresponding 1.1 database files Result: total of 12MB contents. zzz$ du -ba content/news/ 4202496 content/news/db/db.tbl 4202496 content/news/db 4202496 content/news/meta/meta.tbl 4202496 content/news/meta 4202496 content/news/news.tbl 12607488 content/news/ zzz$ ===end===
|
|
|
Re: xindice_rebuild - file size vastly multipliedOn Thu, Apr 16, 2009 at 3:32 PM, OKO <olleo@...> wrote:
[snip] > > The "1.1" vs "1.0" expansion factor is much larger for the case of > collections with few and small files (factor 350) , than for many and small > files (factor 25). > > But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0 > from a *size* point of view. Maybe comparing processing speed of "1.1" vs > "1.0" would give a different picture, but at the moment I have no numbers to > offer in that respect. Here is the deal: database format between 1.0 and 1.1 hasn't changed, except some bugs in index files were fixed and the collection files are now created properly, which in your case, unfortunately, means they take a lot more space than before. The reason is that default size is intended for bigger collections with many documents. The database you tried to rebuild is the corner case however, with several collections that only have a single document each, so default size settings are way too large. But this is only the default setting and it can be change to more suitable number. This way, 1.1 database should take about the same amount of space as the existing database. > By the way, I do not have any very large documents in the database, so I > have no idea how "1.1" compares to "1.0" for such beasts. > > And the 64,000 $ question is, of course, what means are available to > decrease disk space footprint of the "1.1" database? The following steps can help to reduce database size: 1. Meta collections can be turned off. The setting is in the config/system.xml file (In Xindice 1.1 directory). In the following line: <root-collection dbroot="./db/" name="db" use-metadata="on"> change use-metadata="off". This will make all Meta collections go away. 2. Initial collection size can be adjusted (here I assume that all the collections use default BTreeFiler to store data, HashFiler is somewhat different beast). When creating a collection it can be given a configuration to specify pagecount setting that directly affect initial collection size: <collection compressed="true" inline-metadata="true" name="test"> <filer class="org.apache.xindice.core.filer.BTreeFiler" pagecount="16" /> </collection> When creating collection from command-line tool, this setting can be specified with --pagecount parameter: bin\xindice ac -c /db -n test --pagecount 16 The "perfect" value for pagecount depends on size and amount of documents in a collection. For collections that have documents added and modified often it should be picked based on how much data is expected to be stored in the collection, but in any case it only sets initial size, so collection file will grow when it is too small to hold new data. In your case, when a collection has only one small document and this situation is not expected to change, pagecount may very well be set to 1. Now back to data migration... I made some changes to xindice_rebuild to add optional pagecount parameter that will overwrite original collection setting. If you have source version of Xindice download, please save attached file to <xindice directory>\java\src\org\apache\xindice\tools\DatabaseRebuild.java and run build.bat or build.sh depending on your OS. After that, you can try to rebuild database again using optional parameter: bin\xindice_rebuild.bat rebuild db -p 1 If you have binary version of Xindice download instead, please let me know, I'll see what can be done for that. Alternatively, collections can be rebuilt manually, by exporting documents, creating new collections using command-line tool with the option "--pagecount 1" and importing documents into new collections. Let me know if something doesn't work for you. Natalia |
|
|
Re: xindice_rebuild - file size vastly multipliedNatalia,
Regarding DatabaseRebuild.java ... Well, I originally downloaded the binary 1.1 package (Windows), so I did not have a smooth setup for re-compiling the whole thing. As I wanted a very quick test, I did recompile DatabaseRebuild.java, by putting it in a separate directory branch, and using the jars as classpath context. It compiled fine. Then I used it to redo the rebuild, using "-p 1" as suggested. And tested with the xindice (1.1) commandline tool to list/retrieve collections/documents and it works. The result was, for all practical purposes, a 1.1 db representation that was equal in size to the original 1.0 db representation. The only size difference seems to be: - system/SysConfig ... 1.0: 24576 bytes vs 1.1: 32768 bytes - system/SysSymbols ... 1.0 659456 bytes vs 1.1: 651264 bytes which is completely negligible. Technically, the outcome of this initial test seems OK, so I will embark on a larger test in a week or two. As I had these challenges about disk space, it could be useful to find some heuristics on the Xindice Wiki http://wiki.apache.org/xindice/ about how settings of these parameters may influence the physical disk space needed. Some statements you made in the last mail certainly could be useful to find there. regards, /O
|
| Free embeddable forum powered by Nabble | Forum Help |