xindice_rebuild - file size vastly multiplied

View: New views
5 Messages — Rating Filter:   Alert me  

xindice_rebuild - file size vastly multiplied

by OKO :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ran xindice_rebuild on a smallish existing 1.0 database to get it into  1.1 format.
The size was expanded enormously:
db 1.0: 2 M
db 1.1: 233 M

That is a factor of 100 ;-(

I have not dared do this on my real 1.0 database, which now occupies more than 1G of disk space.

*Question*:
 - Is this a feature or a bug?

Tool info as presented:
$ xindice -h
trying to register database

Xindice Command Tools v1.1

...etc...

/O

Re: xindice_rebuild - file size vastly multiplied

by Natalia Shilenkova :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There was a change in Xindice v1.1 that could possibly be responsible
for database size increase.

Xindice v1.0 failed to correctly allocate initial file space for a
collection according to its page size and page count parameters, so a
collection with just a few documents would occupy several Kb on disk
instead of reserving some space on disk for the collection to grow. It
was fixed in v1.1 and a collection with default page size (4Kb) and
page count (1024) parameters now occupies about 4Mb.

If you had several small collections in Xindice v1.0 it is possible
that after rebuilding them for v1.1 the database would take
considerably more disk space. How many collections do you have in that
database and how big are the collections? If this indeed is the reason
for the database increase, it could be fixed by adjusting page count
parameter for small collections.

Also, I think Meta collections were introduced somewhere between 1.0
and 1.1, I cannot remember now if they are created when rebuilding a
database, but if they are, it would take some disk space, too. You can
explore database directories to see if Meta collections are there (in
system/Metas, I think). Meta collections can be turned off.

Natalia

On Thu, Apr 16, 2009 at 9:35 AM, OKO <olleo@...> wrote:

>
> Ran xindice_rebuild on a smallish existing 1.0 database to get it into  1.1
> format.
> The size was expanded enormously:
> db 1.0: 2 M
> db 1.1: 233 M
>
> That is a factor of 100 ;-(
>
> I have not dared do this on my real 1.0 database, which now occupies more
> than 1G of disk space.
>
> *Question*:
>  - Is this a feature or a bug?
>
> Tool info as presented:
> $ xindice -h
> trying to register database
>
> Xindice Command Tools v1.1
>
> ...etc...
>
> /O
> --
> View this message in context: http://www.nabble.com/xindice_rebuild---file-size-vastly-multiplied-tp23078111p23078111.html
> Sent from the Xindice - Users mailing list archive at Nabble.com.
>
>

Re: xindice_rebuild - file size vastly multiplied

by OKO :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Natalia,
Thanks for offering some hints.

To put some data on the table ...

Below some size quantities for two parts of the database
   (1) a collection with few and small documents
   (2) a collection with many small documents

For each of these I give size of
   (A) exported format
   (B) xindice 1.0 database
   (C) xindice 1.1 database

Sizes are computed by "du" (running Cygwin).

Summary table of sizes in KB (table more readable in fixed font ;-) ):

.      |   (A)   |   (B)   |   (C)   |
. (1)  |      20 |     250 |   84000 |
. (2)  |    4000 |     450 |   12000 |

The "1.1" vs "1.0" expansion factor is much larger for the case of collections with few and small files (factor 350) , than for many and small files (factor 25).

But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0 from a *size* point of view. Maybe comparing processing speed of "1.1" vs "1.0" would give a different picture, but at the moment I have no numbers to offer in that respect.

By the way, I do not have any very large documents in the database, so I have no idea how "1.1" compares to "1.0" for such beasts.

And the 64,000 $ question is, of course, what means are available to decrease disk space footprint of the "1.1"  database?

/O


==========================
(1) a set of collections with few and small documents

(A) This is a size listing of (part of) the exported 1.0 database.
Result: total of 20KB contents.
Most of these files are small.
(Note: the "meta" collections/files are my own meta info of the application level, so not Xindice meta info!).

xxx$ du -ba publications
553     publications/rss-common/meta/meta.xml
1065    publications/rss-common/meta
1577    publications/rss-common
257     publications/rss_2_0/meta/meta.xml
769     publications/rss_2_0/meta
1281    publications/rss_2_0
260     publications/rss_0_92/meta/meta.xml
772     publications/rss_0_92/meta
1284    publications/rss_0_92
260     publications/rss_0_91/meta/meta.xml
772     publications/rss_0_91/meta
1284    publications/rss_0_91
438     publications/newsletter/meta/meta.xml
950     publications/newsletter/meta
1462    publications/newsletter
512     publications/archive/meta
1024    publications/archive
429     publications/newsletter-index/meta/meta.xml
941     publications/newsletter-index/meta
1453    publications/newsletter-index
427     publications/news-archive/meta/meta.xml
939     publications/news-archive/meta
1451    publications/news-archive
394     publications/home-page/meta/meta.xml
906     publications/home-page/meta
1418    publications/home-page
257     publications/rss_1_0/meta/meta.xml
769     publications/rss_1_0/meta
1281    publications/rss_1_0
315     publications/home-page-time-stamp/meta/meta.xml
827     publications/home-page-time-stamp/meta
1339    publications/home-page-time-stamp
350     publications/events/meta/meta.xml
862     publications/events/meta
1374    publications/events
1634    publications/meta/meta.xml
2146    publications/meta
18886   publications
xxx$

(B) The size of the corresponding 1.0 database files
Result: total of 250KB contents.

yyy$ du -ba publications/
12288   publications/archive/archive.tbl
12288   publications/archive/meta/meta.tbl
12288   publications/archive/meta
24576   publications/archive
12288   publications/home-page/home-page.tbl
12288   publications/home-page/meta/meta.tbl
12288   publications/home-page/meta
24576   publications/home-page
12288   publications/meta/meta.tbl
12288   publications/meta
12288   publications/news-archive/meta/meta.tbl
12288   publications/news-archive/meta
12288   publications/news-archive/news-archive.tbl
24576   publications/news-archive
12288   publications/newsletter/meta/meta.tbl
12288   publications/newsletter/meta
12288   publications/newsletter/newsletter.tbl
24576   publications/newsletter
12288   publications/publications.tbl
12288   publications/rss-common/meta/meta.tbl
12288   publications/rss-common/meta
12288   publications/rss-common/rss-common.tbl
24576   publications/rss-common
12288   publications/rss_0_91/meta/meta.tbl
12288   publications/rss_0_91/meta
12288   publications/rss_0_91/rss_0_91.tbl
24576   publications/rss_0_91
12288   publications/rss_0_92/meta/meta.tbl
12288   publications/rss_0_92/meta
12288   publications/rss_0_92/rss_0_92.tbl
24576   publications/rss_0_92
12288   publications/rss_1_0/meta/meta.tbl
12288   publications/rss_1_0/meta
12288   publications/rss_1_0/rss_1_0.tbl
24576   publications/rss_1_0
12288   publications/rss_2_0/meta/meta.tbl
12288   publications/rss_2_0/meta
12288   publications/rss_2_0/rss_2_0.tbl
24576   publications/rss_2_0
245760  publications/
yyy$

(C) ... and of the corresponding 1.1 database files
Result: total of 84000KB contents.


zzz$ du -ba publications/
4202496 publications/archive/archive.tbl
4202496 publications/archive/meta/meta.tbl
4202496 publications/archive/meta
8404992 publications/archive
4202496 publications/home-page/home-page.tbl
4202496 publications/home-page/meta/meta.tbl
4202496 publications/home-page/meta
8404992 publications/home-page
4202496 publications/meta/meta.tbl
4202496 publications/meta
4202496 publications/news-archive/meta/meta.tbl
4202496 publications/news-archive/meta
4202496 publications/news-archive/news-archive.tbl
8404992 publications/news-archive
4202496 publications/newsletter/meta/meta.tbl
4202496 publications/newsletter/meta
4202496 publications/newsletter/newsletter.tbl
8404992 publications/newsletter
4202496 publications/publications.tbl
4202496 publications/rss-common/meta/meta.tbl
4202496 publications/rss-common/meta
4202496 publications/rss-common/rss-common.tbl
8404992 publications/rss-common
4202496 publications/rss_0_91/meta/meta.tbl
4202496 publications/rss_0_91/meta
4202496 publications/rss_0_91/rss_0_91.tbl
8404992 publications/rss_0_91
4202496 publications/rss_0_92/meta/meta.tbl
4202496 publications/rss_0_92/meta
4202496 publications/rss_0_92/rss_0_92.tbl
8404992 publications/rss_0_92
4202496 publications/rss_1_0/meta/meta.tbl
4202496 publications/rss_1_0/meta
4202496 publications/rss_1_0/rss_1_0.tbl
8404992 publications/rss_1_0
4202496 publications/rss_2_0/meta/meta.tbl
4202496 publications/rss_2_0/meta
4202496 publications/rss_2_0/rss_2_0.tbl
8404992 publications/rss_2_0
84049920 publications/
zzz$


==========================
(2) a collection with a large set of small documents

(A) This is a size listing of (part of) the exported 1.0 database.
This one contains approx 4000 rather small documents.
Result: total of 4000KB contents (so approx 1KB per document)

xxx$ du -b content/news
3932206 content/news/db
665     content/news/meta
3933383 content/news
xxx$

(note: not displaying individual files/documents ... there are too many of them)

(B) The size of the corresponding 1.0 database files
Result: total of 450KB contents.

yyy$ du -ba content/news/
430080  content/news/db/db.tbl
430080  content/news/db
12288   content/news/meta/meta.tbl
12288   content/news/meta
12288   content/news/news.tbl
454656  content/news/
yyy$

(C) ... and of the corresponding 1.1 database files
Result: total of 12MB contents.

zzz$ du -ba content/news/
4202496 content/news/db/db.tbl
4202496 content/news/db
4202496 content/news/meta/meta.tbl
4202496 content/news/meta
4202496 content/news/news.tbl
12607488   content/news/
zzz$

===end===

Natalia Shilenkova wrote:
There was a change in Xindice v1.1 that could possibly be responsible
for database size increase.

Xindice v1.0 failed to correctly allocate initial file space for a
collection according to its page size and page count parameters, so a
collection with just a few documents would occupy several Kb on disk
instead of reserving some space on disk for the collection to grow. It
was fixed in v1.1 and a collection with default page size (4Kb) and
page count (1024) parameters now occupies about 4Mb.

If you had several small collections in Xindice v1.0 it is possible
that after rebuilding them for v1.1 the database would take
considerably more disk space. How many collections do you have in that
database and how big are the collections? If this indeed is the reason
for the database increase, it could be fixed by adjusting page count
parameter for small collections.

Also, I think Meta collections were introduced somewhere between 1.0
and 1.1, I cannot remember now if they are created when rebuilding a
database, but if they are, it would take some disk space, too. You can
explore database directories to see if Meta collections are there (in
system/Metas, I think). Meta collections can be turned off.

Natalia

On Thu, Apr 16, 2009 at 9:35 AM, OKO <olleo@sics.se> wrote:
>
> Ran xindice_rebuild on a smallish existing 1.0 database to get it into  1.1
> format.
> The size was expanded enormously:
> db 1.0: 2 M
> db 1.1: 233 M
>
> That is a factor of 100 ;-(
>
> I have not dared do this on my real 1.0 database, which now occupies more
> than 1G of disk space.
>
> *Question*:
>  - Is this a feature or a bug?
>
> Tool info as presented:
> $ xindice -h
> trying to register database
>
> Xindice Command Tools v1.1
>
> ...etc...
>
> /O
> --
> View this message in context: http://www.nabble.com/xindice_rebuild---file-size-vastly-multiplied-tp23078111p23078111.html
> Sent from the Xindice - Users mailing list archive at Nabble.com.
>
>

Re: xindice_rebuild - file size vastly multiplied

by Natalia Shilenkova :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Apr 16, 2009 at 3:32 PM, OKO <olleo@...> wrote:
[snip]
>
> The "1.1" vs "1.0" expansion factor is much larger for the case of
> collections with few and small files (factor 350) , than for many and small
> files (factor 25).
>
> But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0
> from a *size* point of view. Maybe comparing processing speed of "1.1" vs
> "1.0" would give a different picture, but at the moment I have no numbers to
> offer in that respect.

Here is the deal: database format between 1.0 and 1.1 hasn't changed,
except some bugs in index files were fixed and the collection files
are now created properly, which in your case, unfortunately, means
they take a lot more space than before. The reason is that default
size is intended for bigger collections with many documents.

The database you tried to rebuild is the corner case however, with
several collections that only have a single document each, so default
size settings are way too large. But this is only the default setting
and it can be change to more suitable number. This way, 1.1 database
should take about the same amount of space as the existing database.

> By the way, I do not have any very large documents in the database, so I
> have no idea how "1.1" compares to "1.0" for such beasts.
>
> And the 64,000 $ question is, of course, what means are available to
> decrease disk space footprint of the "1.1"  database?

The following steps can help to reduce database size:
1. Meta collections can be turned off. The setting is in the
config/system.xml file (In Xindice 1.1 directory). In the following
line:

    <root-collection dbroot="./db/" name="db" use-metadata="on">

change use-metadata="off". This will make all Meta collections go away.

2. Initial collection size can be adjusted (here I assume that all the
collections use default BTreeFiler to store data, HashFiler is
somewhat different beast). When creating a collection it can be given
a configuration to specify pagecount setting that directly affect
initial collection size:

<collection compressed="true" inline-metadata="true" name="test">
  <filer class="org.apache.xindice.core.filer.BTreeFiler" pagecount="16" />
</collection>

When creating collection from command-line tool, this setting can be
specified with --pagecount parameter:

bin\xindice ac -c /db -n test --pagecount 16

The "perfect" value for pagecount depends on size and amount of
documents in a collection. For collections that have documents added
and modified often it should be picked based on how much data is
expected to be stored in the collection, but in any case it only sets
initial size, so collection file will grow when it is too small to
hold new data.

In your case, when a collection has only one small document and this
situation is not expected to change, pagecount may very well be set to
1.

Now back to data migration... I made some changes to xindice_rebuild
to add optional pagecount parameter that will overwrite original
collection setting.

If you have source version of Xindice download, please save attached
file to <xindice
directory>\java\src\org\apache\xindice\tools\DatabaseRebuild.java and
run build.bat or build.sh depending on your OS. After that, you can
try to rebuild database again using optional parameter:

bin\xindice_rebuild.bat rebuild db -p 1

If you have binary version of Xindice download instead, please let me
know, I'll see what can be done for that. Alternatively, collections
can be rebuilt manually, by exporting documents, creating new
collections using command-line tool with the option "--pagecount 1"
and importing documents into new collections.

Let me know if something doesn't work for you.

Natalia


DatabaseRebuild.java (20K) Download Attachment

Re: xindice_rebuild - file size vastly multiplied

by OKO :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Natalia,

Regarding DatabaseRebuild.java ...

Well, I originally downloaded the binary 1.1 package (Windows), so I did not have a smooth setup for re-compiling the whole thing.

As I wanted a very quick test, I did recompile DatabaseRebuild.java, by putting it in a separate directory branch, and using the jars as classpath context. It compiled fine.

Then I used it to redo the rebuild, using "-p 1" as suggested.
And tested with the xindice (1.1) commandline tool to list/retrieve collections/documents and it works.

The result was, for all practical purposes, a 1.1 db representation that was equal in size to the original 1.0 db representation.

The only size difference seems to be:
 - system/SysConfig ... 1.0:    24576 bytes    vs 1.1:    32768 bytes
 - system/SysSymbols ... 1.0    659456 bytes    vs 1.1:    651264 bytes
which is completely negligible.

Technically, the outcome of this initial test seems OK, so I will embark on a larger test in a week or two.


As I had these challenges about disk space, it could be useful to find some heuristics on the Xindice Wiki
   http://wiki.apache.org/xindice/
about how settings of these parameters may influence the physical disk space needed. Some statements you made in the last mail certainly could be useful to find there.


regards,

/O


Natalia Shilenkova wrote:
On Thu, Apr 16, 2009 at 3:32 PM, OKO <olleo@sics.se> wrote:
[snip]
>
> The "1.1" vs "1.0" expansion factor is much larger for the case of
> collections with few and small files (factor 350) , than for many and small
> files (factor 25).
>
> But nevertheless, I wonder for what kind of databases 1.1 would beat 1.0
> from a *size* point of view. Maybe comparing processing speed of "1.1" vs
> "1.0" would give a different picture, but at the moment I have no numbers to
> offer in that respect.

Here is the deal: database format between 1.0 and 1.1 hasn't changed,
except some bugs in index files were fixed and the collection files
are now created properly, which in your case, unfortunately, means
they take a lot more space than before. The reason is that default
size is intended for bigger collections with many documents.

The database you tried to rebuild is the corner case however, with
several collections that only have a single document each, so default
size settings are way too large. But this is only the default setting
and it can be change to more suitable number. This way, 1.1 database
should take about the same amount of space as the existing database.

> By the way, I do not have any very large documents in the database, so I
> have no idea how "1.1" compares to "1.0" for such beasts.
>
> And the 64,000 $ question is, of course, what means are available to
> decrease disk space footprint of the "1.1"  database?

The following steps can help to reduce database size:
1. Meta collections can be turned off. The setting is in the
config/system.xml file (In Xindice 1.1 directory). In the following
line:

    <root-collection dbroot="./db/" name="db" use-metadata="on">

change use-metadata="off". This will make all Meta collections go away.

2. Initial collection size can be adjusted (here I assume that all the
collections use default BTreeFiler to store data, HashFiler is
somewhat different beast). When creating a collection it can be given
a configuration to specify pagecount setting that directly affect
initial collection size:

<collection compressed="true" inline-metadata="true" name="test">
  <filer class="org.apache.xindice.core.filer.BTreeFiler" pagecount="16" />
</collection>

When creating collection from command-line tool, this setting can be
specified with --pagecount parameter:

bin\xindice ac -c /db -n test --pagecount 16

The "perfect" value for pagecount depends on size and amount of
documents in a collection. For collections that have documents added
and modified often it should be picked based on how much data is
expected to be stored in the collection, but in any case it only sets
initial size, so collection file will grow when it is too small to
hold new data.

In your case, when a collection has only one small document and this
situation is not expected to change, pagecount may very well be set to
1.

Now back to data migration... I made some changes to xindice_rebuild
to add optional pagecount parameter that will overwrite original
collection setting.

If you have source version of Xindice download, please save attached
file to <xindice
directory>\java\src\org\apache\xindice\tools\DatabaseRebuild.java and
run build.bat or build.sh depending on your OS. After that, you can
try to rebuild database again using optional parameter:

bin\xindice_rebuild.bat rebuild db -p 1

If you have binary version of Xindice download instead, please let me
know, I'll see what can be done for that. Alternatively, collections
can be rebuilt manually, by exporting documents, creating new
collections using command-line tool with the option "--pagecount 1"
and importing documents into new collections.

Let me know if something doesn't work for you.

Natalia