[interval container (itl)] A pretty large bitset

View: New views
8 Messages — Rating Filter:   Alert me  

[interval container (itl)] A pretty large bitset

by Joachim Faulhaber :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear list!

In thread

http://lists.boost.org/Archives/boost/2009/06/152786.php

Zachary Turner posted an article about a compressed bitset,
that uses run-length defined intervals to achieve a compact
representation of large bitsets that are useful for
managing huge file allocation tables.

As I mentioned in a reply to this article, Zach's compressed
bitsets, in their basic ideas, are quite similar to
itl::interval_sets: They allow for compression of large
quantities of elements as long as they occur in (large)
uninterrupted chunks.

As Zach pointed out, this nice behavior is completely
spoiled, if we would want to handle a set like

{1,3,5,7,9,...,2(k-1)}   or as bitset
'0101010101...01'

Here, the capability of usual bitsets to compress elements
that are distributed over only small ranges can neither
be applied to Zach's compressed bitset nor to an
itl::interval_set of the Interval Template Library.

As a new use case of interval containers I have developed
a *large_bitset* class template on top of itl::interval_map
that is able to combine interval compression *and* bitset
compression. It is, for instance, capable of representing
the bitset mentioned above

'01010101...010'
 a            b

 a: bit 0
 b: bit 2^64-1

containing 2^63 bits (or elements) with just *one* internal
interval associated to only *one* value of 64bit length ;)

Curious??

Find the documented example code in my updated library
docs at

http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html

While the documented class template 'large_bitset', for
reasons of a short presentation, is only a minimal one,
you can find a more elaborated version as 'interval_bitset'
in the sandbox at

https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp

I'd like to ask Zachary Turner to comment on the
use case. Can this be useful in your field?

Moreover I'd like to encourage everyone to give feedback
on my interval container library, report problems, bugs
and share suggestions. If you find this stuff useful
please volunteer as review manager for the submitted
library Boost Interval Containers.

Note, that the new example on 'large_bitsets'
is not yet included in the library submitted for review
in the vault but it is contained in the sandbox.
I am currently preparing an update of the review version of
the itl, that is adjusted and corrected for the feedback,
suggestions and patches that I have received during the
last weeks. I will integrate further feedback and account
for bug reports if you have any.

Best regards
Joachim
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by OvermindDL1 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 2, 2009 at 10:25 AM, Joachim Faulhaber
<afojgo@...> wrote:

> Dear list!
>
> In thread
>
> http://lists.boost.org/Archives/boost/2009/06/152786.php
>
> Zachary Turner posted an article about a compressed bitset,
> that uses run-length defined intervals to achieve a compact
> representation of large bitsets that are useful for
> managing huge file allocation tables.
>
> As I mentioned in a reply to this article, Zach's compressed
> bitsets, in their basic ideas, are quite similar to
> itl::interval_sets: They allow for compression of large
> quantities of elements as long as they occur in (large)
> uninterrupted chunks.
>
> As Zach pointed out, this nice behavior is completely
> spoiled, if we would want to handle a set like
>
> {1,3,5,7,9,...,2(k-1)}   or as bitset
> '0101010101...01'
>
> Here, the capability of usual bitsets to compress elements
> that are distributed over only small ranges can neither
> be applied to Zach's compressed bitset nor to an
> itl::interval_set of the Interval Template Library.
>
> As a new use case of interval containers I have developed
> a *large_bitset* class template on top of itl::interval_map
> that is able to combine interval compression *and* bitset
> compression. It is, for instance, capable of representing
> the bitset mentioned above
>
> '01010101...010'
>  a            b
>
>  a: bit 0
>  b: bit 2^64-1
>
> containing 2^63 bits (or elements) with just *one* internal
> interval associated to only *one* value of 64bit length ;)
>
> Curious??
>
> Find the documented example code in my updated library
> docs at
>
> http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html
>
> While the documented class template 'large_bitset', for
> reasons of a short presentation, is only a minimal one,
> you can find a more elaborated version as 'interval_bitset'
> in the sandbox at
>
> https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp
>
> I'd like to ask Zachary Turner to comment on the
> use case. Can this be useful in your field?
>
> Moreover I'd like to encourage everyone to give feedback
> on my interval container library, report problems, bugs
> and share suggestions. If you find this stuff useful
> please volunteer as review manager for the submitted
> library Boost Interval Containers.
>
> Note, that the new example on 'large_bitsets'
> is not yet included in the library submitted for review
> in the vault but it is contained in the sandbox.
> I am currently preparing an update of the review version of
> the itl, that is adjusted and corrected for the feedback,
> suggestions and patches that I have received during the
> last weeks. I will integrate further feedback and account
> for bug reports if you have any.

Hmm, this looks quite fascinating, I have very random bitset and
normal sparse methods do not work, might see if this does.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Scott McMurray-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/2 Joachim Faulhaber <afojgo@...>:

>
> As a new use case of interval containers I have developed
> a *large_bitset* class template on top of itl::interval_map
> that is able to combine interval compression *and* bitset
> compression. It is, for instance, capable of representing
> the bitset mentioned above
>
> '01010101...010'
>  a            b
>
>  a: bit 0
>  b: bit 2^64-1
>
> containing 2^63 bits (or elements) with just *one* internal
> interval associated to only *one* value of 64bit length ;)
>
> Curious??
>

Quite.

How does it handle a repeat of 128 1s then 128 0s then 128 1s etc?
(Or any repeating signal with a period longer than one storage unit?)

~ Scott
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Zachary Turner-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 2, 2009 at 11:25 AM, Joachim Faulhaber <afojgo@...>wrote:

>
> I'd like to ask Zachary Turner to comment on the
> use case. Can this be useful in your field?
>
> Moreover I'd like to encourage everyone to give feedback
> on my interval container library, report problems, bugs
> and share suggestions. If you find this stuff useful
> please volunteer as review manager for the submitted
> library Boost Interval Containers.
>
> Note, that the new example on 'large_bitsets'
> is not yet included in the library submitted for review
> in the vault but it is contained in the sandbox.
> I am currently preparing an update of the review version of
> the itl, that is adjusted and corrected for the feedback,
> suggestions and patches that I have received during the
> last weeks. I will integrate further feedback and account
> for bug reports if you have any.
>

This is interesting and at some point I might download the library and play
with it.  Is there any easy way for me to determine the memory usage of an
interval_map / interval_set?

It's hard to say whether this will offer an advantage in my use case because
being that I'm interested in primarily a) which blocks on a filesystem are
allocated, and b) which blocks on a filesystem have changed since the last
time I checked, it depends heavily on the usage patterns of the user.

The which-blocks-are-allocated bitset would be more likely to *not* benefit
from this type of additional compression, or at least not significantly,
because it would be very rare on a disk to have a  significantly high
concentration of alternating bits, or extremely short runs.

The which-blocks-have-changed-since-last-time bitset is a little more
uncertain.  Obviously file access patterns aren't truly random, but let's
just say in the worst case scenario of completely random disk access, and
hence completely random bits being flipped from 0 to 1, how would this
affect the memory usage of this template?  I feel like this is kind of a
difficult situation to analyze so I'm not looking for an exact O() storage
complexity or anything, but just any insight you have to offer.   My current
solution compresses only intervals and thus also suffers in the case of
random access, but I wonder if this solution would allow me to save space.

What kind of iteration support is provided?  I often need to iterate through
each bit that is set to 1, where the value_type of the iterator is the
0-based index of the bit in the bitset.  So for example if I have a bitset
representing the bit string 0110110101100 I would want, or at least I would
want to be able to easily create, an iterator that returns 1, 2, 4, 5, 7, 9,
10.  The other main important thing would be the ability to do in-place
bitwise |, &, ~.   Advancing an iterator also needs to have constant-time
complexity for me, and the bitwise operators should probably be no greater
than n log(n).  O(n) would be ideal on the bitwiswe operators, but I don't
think my current implementation supports that due to the need to do interval
merging and balancing.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Joachim Faulhaber :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/3 OvermindDL1 <overminddl1@...>:

> On Mon, Nov 2, 2009 at 10:25 AM, Joachim Faulhaber
> <afojgo@...> wrote:
>>
>> Find the documented example code in my updated library
>> docs at
>>
>> http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html
>>
>
> Hmm, this looks quite fascinating, I have very random bitset and
> normal sparse methods do not work, might see if this does.

I hope I does :)

But I have to point out, that a randomly distributed *and*
sparse set is the worst case scenario for the 'large_bitset'
as well.

(1) If you have long runs of bits set to '1' (or repeating bit patterns),
  interval compression kicks in.
(2) If you have regions with randomly but relatively densely distributed
  bits, bit compression will help.

For a randomly distributed and sparse set non of these
compression techniques can be applied. In the worst case
you are holding one tree node of the underlying rb-tree for
every bit.

Joachim
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Joachim Faulhaber :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/3 Scott McMurray <me22.ca+boost@...>:

> 2009/11/2 Joachim Faulhaber <afojgo@...>:
>>
>> As a new use case of interval containers I have developed
>> a *large_bitset* class template on top of itl::interval_map
>> that is able to combine interval compression *and* bitset
>> compression. It is, for instance, capable of representing
>> the bitset mentioned above
>>
>> '01010101...010'
>>  a            b
>>
>>  a: bit 0
>>  b: bit 2^64-1
>>
>> containing 2^63 bits (or elements) with just *one* internal
>> interval associated to only *one* value of 64bit length ;)
>>
>> Curious??
>>
>
> Quite.

The compact storage of repeated bit patterns is only the lead story
to baffle the reader a bit ;)

In the end I do not think that this is the most important property of
large_bitset. In fact it is just a byproduct of the basic idea to
combine interval and bitset compression via an itl::interval_map of
bitsets.

> How does it handle a repeat of 128 1s then 128 0s then 128 1s etc?
> (Or any repeating signal with a period longer than one storage unit?)

So a large_bitset<uint64_t, mini::bits<uint64_t> > would handle
them this way:

[0,2)->"111...1"  (all 64 bits set)
[2,4)->"111...1"
...
[k,k+2)->"111...1"

which is obviously much less compact than the initial example
but still a good compression.

But, to be honest, the "periodic bit compression magic" has
more limitations than that. For instance it does not manage
to represent a bit pattern
"100100100...100"
with a single node like this can be done for patterns with
an even period length <= 64 (or word_length).

As I said, it's just a nice byproduct of the design. More
important IMO is the combination of interval and bit
compression as such and also the fact that large_bitset
can be implemented with *very little* code

http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html

on top of itl::interval_map, which demonstrates the potential
of interval containers.

Cheers
Joachim
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Joachim Faulhaber :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/3 Joachim Faulhaber <afojgo@...>:

> 2009/11/3 Scott McMurray <me22.ca+boost@...>:
>> 2009/11/2 Joachim Faulhaber <afojgo@...>:
>>>
>
>> How does it handle a repeat of 128 1s then 128 0s then 128 1s etc?
>> (Or any repeating signal with a period longer than one storage unit?)
>
> So a large_bitset<uint64_t, mini::bits<uint64_t> > would handle
> them this way:
>
> [0,2)->"111...1"  (all 64 bits set)
> [2,4)->"111...1"
> ...
> [k,k+2)->"111...1"
>
> which is obviously much less compact than the initial example
> but still a good compression.
>

Correction:

[0,2)->"111...1"  (all 128 bits set from bit 0 to bit 127)
                        (none of 128 bits set from bit 128 to bit 255)
[4,6)->"111...1"  (all 128 bits set from bit 256 to bit 383) and so on
...
[4*k, 4*k+2)->"111...1"

--
Joachim
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [interval container (itl)] A pretty large bitset

by Joachim Faulhaber :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Zach,

2009/11/3 Zachary Turner <divisortheory@...>:
> On Mon, Nov 2, 2009 at 11:25 AM, Joachim Faulhaber <afojgo@...>wrote:
>
>> I'd like to ask Zachary Turner to comment on the
>> use case. Can this be useful in your field?
>
> This is interesting and at some point I might download the library and play
> with it.  Is there any easy way for me to determine the memory usage of an
> interval_map / interval_set?

interval_map / interval_set are implemented via std::map / std::set.
So for every node carrying an interval / interval-value-pair
you have three pointers (left,right,parent) and a color bit.
The interval itself needs 2*sizeof(domain_type) plus
one byte for the border information. Finally a map node
has to provide space for the value of it's codomain type.

The number of interval or interval-value nodes is
returned by the function interval_count() or iterative_size()
on all interval containers.

> It's hard to say whether this will offer an advantage in my use case because
> being that I'm interested in primarily a) which blocks on a filesystem are
> allocated, and b) which blocks on a filesystem have changed since the last
> time I checked, it depends heavily on the usage patterns of the user.
>
> The which-blocks-are-allocated bitset would be more likely to *not* benefit
> from this type of additional compression, or at least not significantly,
> because it would be very rare on a disk to have a  significantly high
> concentration of alternating bits,

as I have already mentioned in this thread, the periodic bit compression
is a nice byproduct but should not be overemphasized. Clearly we won't
find periodic patterns very often in file allocation tables ;)

> or extremely short runs.

... hmm, I thought, due to fragmentation there could be more
narrow sequences of changing block patterns that could profit
from bitcompression ...

> The which-blocks-have-changed-since-last-time bitset is a little more
> uncertain.  Obviously file access patterns aren't truly random, but let's
> just say in the worst case scenario of completely random disk access, and
> hence completely random bits being flipped from 0 to 1, how would this
> affect the memory usage of this template?  I feel like this is kind of a
> difficult situation to analyze so I'm not looking for an exact O() storage
> complexity or anything, but just any insight you have to offer.   My current
> solution compresses only intervals and thus also suffers in the case of
> random access, but I wonder if this solution would allow me to save space.

It would allow to save space only, if in addition to interval compression
your blocks-have-changed-patterns allow for additional bit compression:
Clusters of changed-blocks within not too big ranges, that
can be compressed into 'local' bitsets.

In the worst case scenario of randomly and sparsely distributed bits
'large_bitset' has no advantage over an itl::interval_set or your
compressed_bitset implementation (if I understood it right).

> What kind of iteration support is provided?  I often need to iterate through
> each bit that is set to 1, where the value_type of the iterator is the
> 0-based index of the bit in the bitset.  So for example if I have a bitset
> representing the bit string 0110110101100 I would want, or at least I would
> want to be able to easily create, an iterator that returns 1, 2, 4, 5, 7, 9,
> 10.

mini::large_bitset started as a non trivial example on the usefulness
of itl::interval_maps. So this implementation
http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html
aims to be very *mini*mal. Also the more elaborated version 'interval_bitset'
https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp
is a first shot and not yet very much field tested (therefore not part
of the core library submitted for review).

With itl::interval_bitset you can currently iterate over intervals
only. I am generally reluctant to implement element iterators
for interval containers, because this would be IMO a permanent
invitation for inefficient usage. Apart from that, it wouldn't
be difficult to provide such an iterator for itl::interval_bitset.

> The other main important thing would be the ability to do in-place
> bitwise |, &, ~.

There are, even in mini::large_bitset, three different overloads for
bitwise operators (except for ~) that are performing inplace,
if the inserted range is already in the container.

As a useful addition I think about adding operators o= to
combine whole (unit-) bitsets at a given offset

interval_bitset& interval_bitset::operator o=
    (const std::pair<domain_type, bitset_type>& bits_at);

> Advancing an iterator also needs to have constant-time
> complexity for me, ...

that would be no problem (apart from the reluctance problem ;)

> ... and the bitwise operators should probably be no greater
> than n log(n).

currently
O(log(n)) for bits (elements)
O(log(n)) for unit bitsets
O(n) for intervals of elements (interval defined bitset)
O(m log(n+m)) for interval_bitsets of iterative_size n and m

Thank you for all your comments and suggestions!

Cheers
Joachim
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost