|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (LUCENE-1879) Parallel incremental indexingParallel incremental indexing
----------------------------- Key: LUCENE-1879 URL: https://issues.apache.org/jira/browse/LUCENE-1879 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Fix For: 3.1 A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. Find details on the wiki page for this feature: http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing Discussion on java-dev: http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749419#action_12749419 ] Michael Busch commented on LUCENE-1879: --------------------------------------- I have a prototype version which I implemented in IBM; it contains a version that works on Lucene 2.4.1. I'm not planning on committing as is, because it is implemented on top of Lucene's APIs without any core change and therefore not as efficiently as it could be. The software grant I have lists these files. Shall I attach the tar + md5 here and send the signed software grant to you, Grant? > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751077#action_12751077 ] Grant Ingersoll commented on LUCENE-1879: ----------------------------------------- Yes on the soft. grant. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1879: ---------------------------------- Attachment: parallel_incremental_indexing.tar MD5 (parallel_incremental_indexing.tar) = b9a92850ad83c4de2dd2f64db2dcceab md5 computed on Mac OS 10.5.7 This tar file contains all files listed in the software grant. It is a prototype that works with Lucene 2.4.x only, not with current trunk. It also has some limitations mentioned before, which are not limitations of the design, but rather because it runs on top of Lucene's APIs (I wanted the code to run with an unmodified Lucene jar). Next I'll work on a patch that runs with current trunk. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773466#action_12773466 ] Michael McCandless commented on LUCENE-1879: -------------------------------------------- I wonder if we could change Lucene's index format to make this feature simpler to implement... Ie, you're having to go to great lengths (since this is built "outside" of Lucene's core) to force multiple separate indexes to share everything but the postings files (merge choices, flush, deletions files, segments files, turning off the stores, etc.). What if we could invert this approach, so that we use only single index/IndexWriter, but we allow "partitioned postings", where sets of fields are mapped to different postings files in the segment? Whenever a doc is indexed, postings from the fields are then written according to this partition. Eg if I map "body" to partition 1, and "title" to partition 2, then I'd have two sets of postings files for each segment. Could something like this work? > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773663#action_12773663 ] Michael Busch commented on LUCENE-1879: --------------------------------------- I realize the current implementation that's attached here is quite complicated, because it works on top of Lucene's APIs. However, I really like its flexibility. You can right now easily rewrite certain parallel indexes without touching others. I use it in quite different ways. E.g you can easily load one parallel index into a RAMDirectory or SSD and leave the other ones on the conventional disk. LUCENE-2025 only optimizes a certain use case of the parallel indexing, where you want to (re)write a parallel index containing *only* posting lists and this will especially improve scenarios like Yonik pointed out a while ago on java-dev where you want to update only a few documents, not e.g. a certain field for all documents. In other use cases it is certainly desirable to have a parallel index that contains a store. It really depends on what data you want to update individually. The version of parallel indexing that goes into Lucene's core I envision quite differently from the current patch here. That's why I'd like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and let's call it IndexManager (the component that controls flushing, merging, etc.). You can then have a ParallelSegmentWriter, which partitions the data into parallel segments, and the IndexManager can behave the same way as before. You can keep thinking about the whole index as a collection of segments, just now it will be a matrix of segments instead of a one-dimensional list. E.g. the norms could in the future be a parallel segment with a single column-stride field that you can update by writing a new generation of the parallel segment. Things like two-dimensional merge policies will nicely fit into this model. Different SegmentWriter implementations will allow you to write single segments in different ways, e.g. doc-at-a-time (the default one with addDocument()) or term-at-a-time (like addIndexes*() works). So I agree we can achieve updating posting lists the way you describe, but it will be limited to posting lists then. If we allow (re)writing *segments* in both dimensions I think we will create a more flexible approach which is independent on what data structures we add to Lucene - as long as they are not global to the index but per-segment as most of Lucene's structures are today. What do you think? Of course I don't want to over-complicate all this, but if we can get LUCENE-2026 right, I think we can implement parallel indexing in this segment-oriented way nicely. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774265#action_12774265 ] Michael McCandless commented on LUCENE-1879: -------------------------------------------- This sounds great! In fact your proposal for a ParallelSegmentWriter is just like what I'm picturing -- making the switching "down low" instead of "up high" (above Lucene). This'd be more generic than just the postings files, since all index files can be separately written. It'd then a low-level question of whether ParallelSegmentWriter stores its files in different Directories, or, a single directory with different file names (or maybe sub-directories within a directory, or, something else). It could even use FileSwitchDirectory, eg to direct certain segment files to an SSD (another way to achieve your example). This should also fit well into LUCENE-1458 (flexible indexing) -- one of the added test cases there creates a per-field codec wrapper that lets you use a different codec per field. Right now, this means separate file names in the same Directory for that segment, but we could allow the codecs to use different Directories (or, FSD as well) if they wanted to. {quote} Different SegmentWriter implementations will allow you to write single segments in different ways, e.g. doc-at-a-time (the default one with addDocument()) or term-at-a-time (like addIndexes*() works). {quote} Can you elaborate on this? How is addIndexes* term-at-a-time? {quote} If we allow (re)writing segments in both dimensions I think we will create a more flexible approach which is independent on what data structures we add to Lucene {quote} Dimension 1 is the docs, and dimension 2 is the assignment of fields into separate partitions? > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774329#action_12774329 ] Michael Busch commented on LUCENE-1879: --------------------------------------- {quote} This sounds great! In fact your proposal for a ParallelSegmentWriter is just like what I'm picturing - making the switching "down low" instead of "up high" (above Lucene). This'd be more generic than just the postings files, since all index files can be separately written. {quote} Right. The goal should it be to be able to use this for updating Lucene internal things (like norms, column-stride fields), but also giving advanced users APIs, so that they can partition their data into parallel indexes according to their update requirements (which the current "above Lucene" approach allows). {quote} t'd then a low-level question of whether ParallelSegmentWriter stores its files in different Directories, or, a single directory with different file names (or maybe sub-directories within a directory, or, something else). It could even use FileSwitchDirectory, eg to direct certain segment files to an SSD (another way to achieve your example). {quote} Exactly! We should also keep the distributed indexing use case in mind here. It could make sense for systems like Katta to not only shard in the document direction. {quote} This should also fit well into LUCENE-1458 {quote} Sounds great! > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774338#action_12774338 ] Michael Busch commented on LUCENE-1879: --------------------------------------- {quote} Can you elaborate on this? How is addIndexes* term-at-a-time? {quote} Let's say we have an index 1 with two fields a and b and you want to create a new parallel index 2 in which you copy all posting lists of field b. You can achieve this by using addDocument(), if you iterate on all posting lists in 1b in parallel and create for each document in 1 a corresponding document in 2 that contains the terms of the postings lists from 1b that have a posting for the current document. This I called the "document-at-a-time approach". However, this is terribly slow (I tried it out), because of all the posting lists you perform I/O on in parallel. It's far more efficient to copy an entire posting list over from 1b to 2, because then you only perform sequential I/O. And if you use 2.addIndexes(IndexReader(1b)), then exactly this happens, because addIndexes(IndexReader) uses the SegmentMerger to add the index. The SegmentMerger iterates the dictionary and consumes the posting lists sequentially. That's why I called this "term-at-a-time approach". In my experience this is for a similar use case as the one I described here orders of magnitudes more efficient. My doc-at-a-time algorithm ran ~20 hours, the term-at-a-time one 8 minutes! The resulting indexes were identical. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1879) Parallel incremental indexing[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774340#action_12774340 ] Michael Busch commented on LUCENE-1879: --------------------------------------- {quote} Dimension 1 is the docs, and dimension 2 is the assignment of fields into separate partitions? {quote} Yes, dimension 1 is unambiguously the docs. Dimension 2 can be the fields into separate parallel indexes, or also what we call today generations for e.g. the norms files. > Parallel incremental indexing > ----------------------------- > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| Free embeddable forum powered by Nabble | Forum Help |