merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

View: New views
7 Messages — Rating Filter:   Alert me  

merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Britske :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Given two parallel indexes which contain the same products but different fields, one with slowly changing fields and one with fields which are updated regularly:

Is it possible to periodically merge these to form a single index?  (thereby representing a frozen snapshot in time)

For example: Can indexWriter.addIndexesNoOptimize handle this, or was it (only) designed for merging shards?
If not, is there another option (3rd party or not) to use, or would I have to resort to low-level hacking?

Thanks,
Geert-Jan

Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

addIndexesNoOptimize is only for shards.

But this [pending patch/contribution] is similar what you're seeking, I think:

  https://issues.apache.org/jira/browse/LUCENE-1879

It does not actually merge the indexes, but rather keeps 2 parallel
indexes in sync so you can use ParallelReader to search them
coherently.

Mike

On Tue, Nov 3, 2009 at 1:46 PM, Britske <gbrits@...> wrote:

>
> Given two parallel indexes which contain the same products but different
> fields, one with slowly changing fields and one with fields which are
> updated regularly:
>
> Is it possible to periodically merge these to form a single index?  (thereby
> representing a frozen snapshot in time)
>
> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
> (only) designed for merging shards?
> If not, is there another option (3rd party or not) to use, or would I have
> to resort to low-level hacking?
>
> Thanks,
> Geert-Jan
> --
> View this message in context: http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26161322.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Britske :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks, but it's already guaranteed that the indexes are in sync. So I could (and do) use parallelReader to search them both at the sime time. This is what my running index looks like.

However at certain points I was considering to store a  frozen index from the parallel index for backup/ other purposes. I figured having it merged would shave of some complexity.

Perhaps this should go to .dev channel but anyway:Before giving this a rest would it be hard to write some low-level code to merge the two? I've never touched the low-level classes and methods (inside documentWriter), but I'm looking for a way to directly change the segment files. .frq, .tis, .tii in particular, the rest can remain untouched for my setup if I'm correct. In other words, I would like to bypass writer.addDocument(), because being able to change to index-files directly would be far more efficient for my situation I believe.

What classes, methods would I need to look into for changing / writing these files directly?

 Some background info why I think a merge could be done rather efficient in this particular situation if I had access to these low level methods:

- we have index A with stored and indexed fields, index B with only indexed fields (with omitTF / omitnorms =true)
- merge index B into index A.
--> probably no need to change Fields (.fdx)  and Field Index (.fdt) (because nothing stored and docids in order)
--> no change to Positions (.prx) Normalization factors (.nrm) (because omitTF / omitnorms =true))

- all fields in index B are prefixed with a particular sequence
--> since all fields in index B are prefixed with a particular sequence, this means I could drop in terms sequentially in Term Infos(.tis) and Term Info Index (.tii) (because of the lexiographical ordening of these files and the prefixed fields inindex B)
--> similarly, because Frequencies (.frq) depends on ordering of .tis I could drop .freq of index B into the correct position of .frq of index A and be done with it. (again because of the same prefix used on all fields of index B)

Would this work? And where to start looking?

Thanks in advance,
Geert-Jan



Michael McCandless-2 wrote:
addIndexesNoOptimize is only for shards.

But this [pending patch/contribution] is similar what you're seeking, I think:

  https://issues.apache.org/jira/browse/LUCENE-1879

It does not actually merge the indexes, but rather keeps 2 parallel
indexes in sync so you can use ParallelReader to search them
coherently.

Mike

On Tue, Nov 3, 2009 at 1:46 PM, Britske <gbrits@gmail.com> wrote:
>
> Given two parallel indexes which contain the same products but different
> fields, one with slowly changing fields and one with fields which are
> updated regularly:
>
> Is it possible to periodically merge these to form a single index?  (thereby
> representing a frozen snapshot in time)
>
> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
> (only) designed for merging shards?
> If not, is there another option (3rd party or not) to use, or would I have
> to resort to low-level hacking?
>
> Thanks,
> Geert-Jan
> --
> View this message in context: http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26161322.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Roughly, your approach sounds correct.  You essentially need to
concatenate tis, tii, frq, prx, but adjusting all absolute pointers
accordingly.  If you look at how SegmentMerge, in its
mergeTerms/mergeTermInfos/appendPostings, makes us of the
FormatPostingsFields/Terms/Docs/PositionsConsumer, that's probably a
good start.  But my guess is there are devils in the details....

Oh, actually another option is to just use addIndexes(IndexReader[]),
passing in your ParallelReader?

Mike

On Wed, Nov 4, 2009 at 6:46 AM, Britske <gbrits@...> wrote:

>
> Thanks, but it's already guaranteed that the indexes are in sync. So I could
> (and do) use parallelReader to search them both at the sime time. This is
> what my running index looks like.
>
> However at certain points I was considering to store a  frozen index from
> the parallel index for backup/ other purposes. I figured having it merged
> would shave of some complexity.
>
> Perhaps this should go to .dev channel but anyway:Before giving this a rest
> would it be hard to write some low-level code to merge the two? I've never
> touched the low-level classes and methods (inside documentWriter), but I'm
> looking for a way to directly change the segment files. .frq, .tis, .tii in
> particular, the rest can remain untouched for my setup if I'm correct. In
> other words, I would like to bypass writer.addDocument(), because being able
> to change to index-files directly would be far more efficient for my
> situation I believe.
>
> What classes, methods would I need to look into for changing / writing these
> files directly?
>
>  Some background info why I think a merge could be done rather efficient in
> this particular situation if I had access to these low level methods:
>
> - we have index A with stored and indexed fields, index B with only indexed
> fields (with omitTF / omitnorms =true)
> - merge index B into index A.
> --> probably no need to change Fields (.fdx)  and Field Index (.fdt)
> (because nothing stored and docids in order)
> --> no change to Positions (.prx) Normalization factors (.nrm) (because
> omitTF / omitnorms =true))
>
> - all fields in index B are prefixed with a particular sequence
> --> since all fields in index B are prefixed with a particular sequence,
> this means I could drop in terms sequentially in Term Infos(.tis) and Term
> Info Index (.tii) (because of the lexiographical ordening of these files and
> the prefixed fields inindex B)
> --> similarly, because Frequencies (.frq) depends on ordering of .tis I
> could drop .freq of index B into the correct position of .frq of index A and
> be done with it. (again because of the same prefix used on all fields of
> index B)
>
> Would this work? And where to start looking?
>
> Thanks in advance,
> Geert-Jan
>
>
>
>
> Michael McCandless-2 wrote:
>>
>> addIndexesNoOptimize is only for shards.
>>
>> But this [pending patch/contribution] is similar what you're seeking, I
>> think:
>>
>>   https://issues.apache.org/jira/browse/LUCENE-1879
>>
>> It does not actually merge the indexes, but rather keeps 2 parallel
>> indexes in sync so you can use ParallelReader to search them
>> coherently.
>>
>> Mike
>>
>> On Tue, Nov 3, 2009 at 1:46 PM, Britske <gbrits@...> wrote:
>>>
>>> Given two parallel indexes which contain the same products but different
>>> fields, one with slowly changing fields and one with fields which are
>>> updated regularly:
>>>
>>> Is it possible to periodically merge these to form a single index?
>>>  (thereby
>>> representing a frozen snapshot in time)
>>>
>>> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
>>> (only) designed for merging shards?
>>> If not, is there another option (3rd party or not) to use, or would I
>>> have
>>> to resort to low-level hacking?
>>>
>>> Thanks,
>>> Geert-Jan
>>> --
>>> View this message in context:
>>> http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26161322.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26194788.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Jérôme Thièvre :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Geert-Jan,

it's possible to merge several parallel physical indexes (viewed as one
logical index with a ParallelReader).
Just use the method IndexWriter.addIndexes(IndexReader[] readers):

IndexReader[]  physicalReaders = ...; // Your readers here

IndexWriter iw = new IndexWriter(...);
ParallelReader parallelReader = new ParallelReader(true);

for(IndexReader reader : physicalReaders)
parallelReader .add(reader);


IndexReader[] readersToMerge = { parallelReader };
iw.addIndexes(readersToMerge);
// Optional optimize
iw.optimize();
iw.close();
parallelReader.close();


I used this trick to deploy an index made of 4 parallel indexes. Using
ParallelReader on large indexes is not very efficient for searching ,
but I need to use parallel indexes because one of them is huge but
evolve very slowly compared to the others.

Jérôme

Britske a écrit :

> Given two parallel indexes which contain the same products but different
> fields, one with slowly changing fields and one with fields which are
> updated regularly:
>
> Is it possible to periodically merge these to form a single index?  (thereby
> representing a frozen snapshot in time)
>
> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
> (only) designed for merging shards?
> If not, is there another option (3rd party or not) to use, or would I have
> to resort to low-level hacking?
>
> Thanks,
> Geert-Jan
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Britske :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yeah passing in ParallelReader should work. I'll try that thanks!

Probably still going to look into this low-level stuff though. Also because some indexes I use (Index B in my example) are pretty costly to index at the moment. At the same time the indexing client has a lot of knowledge about the documents it's indexing: basically with a little extra coding it could know very efficiently the inverted indexes of all indexed fields for the docs it's processing.

Thus basically it very effciently know the contents of .frq. It just needs to get it in the correct format.   This would save tremendous amounts of indexing time. (I'm forgetting .tii .tis for the moment but these should be relatively easy/ efficient to figure out as well from this date if I'm correct.)

I figure Flexible Indexing would give a far better interface for me to plug this in right?

Anyway, thanks for the tip. And yes there will be dragons I'm sure ;-)

Best,
Geert-Jan


Michael McCandless-2 wrote:
Roughly, your approach sounds correct.  You essentially need to
concatenate tis, tii, frq, prx, but adjusting all absolute pointers
accordingly.  If you look at how SegmentMerge, in its
mergeTerms/mergeTermInfos/appendPostings, makes us of the
FormatPostingsFields/Terms/Docs/PositionsConsumer, that's probably a
good start.  But my guess is there are devils in the details....

Oh, actually another option is to just use addIndexes(IndexReader[]),
passing in your ParallelReader?

Mike

On Wed, Nov 4, 2009 at 6:46 AM, Britske <gbrits@gmail.com> wrote:
>
> Thanks, but it's already guaranteed that the indexes are in sync. So I could
> (and do) use parallelReader to search them both at the sime time. This is
> what my running index looks like.
>
> However at certain points I was considering to store a  frozen index from
> the parallel index for backup/ other purposes. I figured having it merged
> would shave of some complexity.
>
> Perhaps this should go to .dev channel but anyway:Before giving this a rest
> would it be hard to write some low-level code to merge the two? I've never
> touched the low-level classes and methods (inside documentWriter), but I'm
> looking for a way to directly change the segment files. .frq, .tis, .tii in
> particular, the rest can remain untouched for my setup if I'm correct. In
> other words, I would like to bypass writer.addDocument(), because being able
> to change to index-files directly would be far more efficient for my
> situation I believe.
>
> What classes, methods would I need to look into for changing / writing these
> files directly?
>
>  Some background info why I think a merge could be done rather efficient in
> this particular situation if I had access to these low level methods:
>
> - we have index A with stored and indexed fields, index B with only indexed
> fields (with omitTF / omitnorms =true)
> - merge index B into index A.
> --> probably no need to change Fields (.fdx)  and Field Index (.fdt)
> (because nothing stored and docids in order)
> --> no change to Positions (.prx) Normalization factors (.nrm) (because
> omitTF / omitnorms =true))
>
> - all fields in index B are prefixed with a particular sequence
> --> since all fields in index B are prefixed with a particular sequence,
> this means I could drop in terms sequentially in Term Infos(.tis) and Term
> Info Index (.tii) (because of the lexiographical ordening of these files and
> the prefixed fields inindex B)
> --> similarly, because Frequencies (.frq) depends on ordering of .tis I
> could drop .freq of index B into the correct position of .frq of index A and
> be done with it. (again because of the same prefix used on all fields of
> index B)
>
> Would this work? And where to start looking?
>
> Thanks in advance,
> Geert-Jan
>
>
>
>
> Michael McCandless-2 wrote:
>>
>> addIndexesNoOptimize is only for shards.
>>
>> But this [pending patch/contribution] is similar what you're seeking, I
>> think:
>>
>>   https://issues.apache.org/jira/browse/LUCENE-1879
>>
>> It does not actually merge the indexes, but rather keeps 2 parallel
>> indexes in sync so you can use ParallelReader to search them
>> coherently.
>>
>> Mike
>>
>> On Tue, Nov 3, 2009 at 1:46 PM, Britske <gbrits@gmail.com> wrote:
>>>
>>> Given two parallel indexes which contain the same products but different
>>> fields, one with slowly changing fields and one with fields which are
>>> updated regularly:
>>>
>>> Is it possible to periodically merge these to form a single index?
>>>  (thereby
>>> representing a frozen snapshot in time)
>>>
>>> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
>>> (only) designed for merging shards?
>>> If not, is there another option (3rd party or not) to use, or would I
>>> have
>>> to resort to low-level hacking?
>>>
>>> Thanks,
>>> Geert-Jan
>>> --
>>> View this message in context:
>>> http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26161322.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/merging-Parallel-indexes-%28can-indexWriter.addIndexesNoOptimize-be-used-%29-tp26161322p26194788.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: merging Parallel indexes (can indexWriter.addIndexesNoOptimize be used?)

by Britske :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yeah excellent! This should indeed work!

Thanks,
Geert-Jan

Jérôme Thièvre wrote:
Hello Geert-Jan,

it's possible to merge several parallel physical indexes (viewed as one
logical index with a ParallelReader).
Just use the method IndexWriter.addIndexes(IndexReader[] readers):

IndexReader[]  physicalReaders = ...; // Your readers here

IndexWriter iw = new IndexWriter(...);
ParallelReader parallelReader = new ParallelReader(true);

for(IndexReader reader : physicalReaders)
parallelReader .add(reader);


IndexReader[] readersToMerge = { parallelReader };
iw.addIndexes(readersToMerge);
// Optional optimize
iw.optimize();
iw.close();
parallelReader.close();


I used this trick to deploy an index made of 4 parallel indexes. Using
ParallelReader on large indexes is not very efficient for searching ,
but I need to use parallel indexes because one of them is huge but
evolve very slowly compared to the others.

Jérôme

Britske a écrit :
> Given two parallel indexes which contain the same products but different
> fields, one with slowly changing fields and one with fields which are
> updated regularly:
>
> Is it possible to periodically merge these to form a single index?  (thereby
> representing a frozen snapshot in time)
>
> For example: Can indexWriter.addIndexesNoOptimize handle this, or was it
> (only) designed for merging shards?
> If not, is there another option (3rd party or not) to use, or would I have
> to resort to low-level hacking?
>
> Thanks,
> Geert-Jan
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org