[jira] Created: (LUCENE-2026) Refactoring of IndexWriter

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

[jira] Created: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Refactoring of IndexWriter
--------------------------

                 Key: LUCENE-2026
                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael Busch
            Assignee: Michael Busch
            Priority: Minor
             Fix For: 3.1


I've been thinking for a while about refactoring the IndexWriter into
two main components.

One could be called a SegmentWriter and as the
name says its job would be to write one particular index segment. The
default one just as today will provide methods to add documents and
flushes when its buffer is full.
Other SegmentWriter implementations would do things like e.g. appending or
copying external segments [what addIndexes*() currently does].

The second component's job would it be to manage writing the segments
file and merging/deleting segments. It would know about
DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
provide hooks that allow users to manage external data structures and
keep them in sync with Lucene's data during segment merges.

API wise there are things we have to figure out, such as where the
updateDocument() method would fit in, because its deletion part
affects all segments, whereas the new document is only being added to
the new segment.

Of course these should be lower level APIs for things like parallel
indexing and related use cases. That's why we should still provide
easy to use APIs like today for people who don't need to care about
per-segment ops during indexing. So the current IndexWriter could
probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773329#action_12773329 ]

John Wang commented on LUCENE-2026:
-----------------------------------

+1

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773429#action_12773429 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

+1!  IndexWriter has become immense.

I think we should also pull out ReaderPool?

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773432#action_12773432 ]

Michael Busch commented on LUCENE-2026:
---------------------------------------

{quote}
I think we should also pull out ReaderPool?
{quote}

+1!

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788838#action_12788838 ]

Earwin Burrfoot commented on LUCENE-2026:
-----------------------------------------

We need an ability to see segment write (and probably deleted doc list write) as a discernible atomic operation. Right now it looks like several file writes, and we can't, say - redirect all files belonging to a certain segment to another Directory (well, in a simple manner). 'Something' should sit between a Directory (or several Directories) and IndexWriter.

If we could do this, the current NRT search implementation will be largely obsoleted, innit? Just override the default impl of 'something' and send smaller segments to ram, bigger to disk, copy ram segments to disk asynchronously if we want to. Then we can use your granma's IndexReader and IndexWriter, totally decoupled from each other, and have blazing fast addDocument-commit-reopen turnaround.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788840#action_12788840 ]

Earwin Burrfoot commented on LUCENE-2026:
-----------------------------------------

Oh, forgive me if I just said something stupid :)

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788856#action_12788856 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

I think what you're describing is in fact the approach that LUCENE-1313 is taking; it's doing the switching internally between the main Dir & a private RAM Dir.

But in my testing so far (LUCENE-2061), it doesn't seem like it'll help performance much.  Ie, the OS generally seems to do a fine job putting those segments in RAM, itself.  Ie, by maintaining a write cache.  The weirdness is: that only holds true if you flush the segments when they are tiny (once per second, every 100 docs, in my test) -- not yet sure why that's the case.  I'm going to re-run perf tests on a more mainstream OS (my tests are all OpenSolaris) and see if that strangeness still happens.

But I think you still need to not do commit() during the reopen.

I do think refactoring IW so that there is a separate component that keeps track of segments in the index, may simplify NRT, in that you can go to that source for your current "segments file" even if that segments file is uncommitted.  In such a world you could do something like IndexReader.open(SegmentState) and it would be able to open (and, reopen) the real-time reader.  It's just that it's seeing changes to the SegmentState done by the writer, even if they're not yet committed.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789473#action_12789473 ]

Earwin Burrfoot commented on LUCENE-2026:
-----------------------------------------

If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. That sync call on memory-based Directory is noop.

And no, you really should commit() to be able to see stuff on reopen() :) My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

The original purporse of current NRT code was.. well.. let me remember.. NRT search! :) With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled??
Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

*: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789555#action_12789555 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

bq. If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call.

I agree, per-segment searching was the most important step towards
NRT.  It's a great step forward...

But the fsync call is a killer, so avoiding it in the NRT path is
necessary.  It's also very OS/FS dependent.

bq. That sync call on memory-based Directory is noop.

Until you need to spillover to disk because your RAM buffer is full?

Also, if IW.commit() is called, I would expect any changes in RAM
should be committed to the real dir (stable storage)?

And, going through RAM first will necessarily be a hit on indexing
throughput (Jake estimates 10% hit in Zoie's case).  Really, our
current approach goes through RAM as well, in that OS's write cache
(if the machine has spare RAM) will quickly accept the small index
files & write them in the BG.  It's not clear we can do better than
the OS here...

bq. And no, you really should commit() to be able to see stuff on reopen()  My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

Uh, this is an API that clearly states that its purpose is to search
the uncommitted changes.  If you really want to be "pure"
transactional, don't use this API ;)

bq. The original purporse of current NRT code was.. well.. let me remember.. NRT search!  With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

In fact I favor our current approach because of its simplicity.

Have a look at LUCENE-1313 (adds RAMDir as you're discussing), or,
Zoie, which also adds the RAMDir and backgrounds resolving deleted
docs -- they add complexity to Lucene that I don't think is warranted.

My general feeling at this point is with per-segment searching, and
fsync avoided, NRT performance is excellent.

We've explored a number of possible tweaks to improve it --
writing first to RAMDir (LUCENE-1313), resolving deletes in the
foreground (LUCENE-2047), using paged BitVector for deletions
(LUCENE-1526), Zoie (buffering segments in RAM & backgrounds resolving
deletes), etc., but, based on testing so far, I don't see the
justification for the added complexity.

bq. *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

This already runs in the BG by default.  But warming the reader on the
merged segment (before lighting it) is important (IW does this today).


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789604#action_12789604 ]

Earwin Burrfoot commented on LUCENE-2026:
-----------------------------------------

bq. Until you need to spillover to disk because your RAM buffer is full?
No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

Good commit() behaviour consists of two parts:
1. Everything commit()ed is guaranteed to be on disk.
2. Until commit() is called, reading threads don't see new/updated records.

Now we want more speed, and are ready to sacrifice something if needed.
You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
So the process goes as:
- You index docs, nobody sees them, nor deletions.
- You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
- Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crach, ram cache or not, OS/raid controller cache or not.

For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789604#action_12789604 ]

Earwin Burrfoot edited comment on LUCENE-2026 at 12/11/09 11:19 PM:
--------------------------------------------------------------------

bq. Until you need to spillover to disk because your RAM buffer is full?
No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

Good commit() behaviour consists of two parts:
1. Everything commit()ed is guaranteed to be on disk.
2. Until commit() is called, reading threads don't see new/updated records.

Now we want more speed, and are ready to sacrifice something if needed.
You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
So the process goes as:
- You index docs, nobody sees them, nor deletions.
- You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
- Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)

      was (Author: earwin):
    bq. Until you need to spillover to disk because your RAM buffer is full?
No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

Good commit() behaviour consists of two parts:
1. Everything commit()ed is guaranteed to be on disk.
2. Until commit() is called, reading threads don't see new/updated records.

Now we want more speed, and are ready to sacrifice something if needed.
You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
So the process goes as:
- You index docs, nobody sees them, nor deletions.
- You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
- Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crach, ram cache or not, OS/raid controller cache or not.

For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)
 

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789614#action_12789614 ]

Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> I say it's better to sacrifice write guarantee.

I don't grok why sync is the default, especially given how sketchy hardware
drivers are about obeying fsync:    

{panel}
  But, beware: some hardware devices may in fact cache writes even during
  fsync,  and return before the bits are actually on stable storage, to give the    
  appearance of faster performance.
{panel}

IMO, it should have been an option which defaults to false, to be enabled only by
users who have the expertise to ensure that fsync() is actually doing what
it advertises. But what's done is done (and Lucy will probably just do something
different.)

With regard to Lucene NRT, though, turning sync() off would really help.  If and
when some sort of settings class comes about, an enableSync(boolean enabled)
method seems like it would come in handy.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789618#action_12789618 ]

Jake Mannix commented on LUCENE-2026:
-------------------------------------

bq. Now we want more speed, and are ready to sacrifice something if needed.
bq. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

Chiming in here that of course, you don't *need* (ie there is a choice) to hack into the IW to do this.  Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance.  It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.  

The API for this kind of thing doesn't *have* to be tightly coupled, and I would agree with you that it shouldn't be.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789708#action_12789708 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
bq. Until you need to spillover to disk because your RAM buffer is full?

No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.
{quote}

But this is where things start to get complex... the devil is in the
details here.  How do you carry over your deletes?  This spillover
will take time -- do you block all indexing while that's happening
(not great)?  Do you do it gradually (start spillover when half full,
but still accept indexing)?  Do you throttle things if index rate
exceeds flush rate?  How do you recover on exception?

NRT today let's the OS's write cache decide how to use RAM to speed up
writing of these small files, which keeps things alot simpler for us.
I don't see why we should add complexity to Lucene to replicate what
the OS is doing for us (NOTE: I don't really trust the OS in the
reverse case... I do think Lucene should read into RAM the data
structures that are important).

bq. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

bq. Now you don't have to hack into IW and write specialized readers.

Probably we'll just have to disagree here... NRT isn't a hack ;)

IW is already hanging onto completely normal segments.  Ie, the index
has been updated with these segments, just not yet published so
outside readers can see it.  All NRT does is let a reader see this
private view.

The readers that an NRT reader expoes are normal SegmentReaders --
it's just that rather than consult a segments_N on disk to get the
segment metadata, they pulled from IW's uncommitted in memory
SegmentInfos instance.

Yes we've talked about the "hot innards" solution -- an IndexReader
impl that can directly search DW's ram buffer -- but that doesn't look
necessary today, because performance of NRT is good with the simple
solution we have now.

NRT reader also gains performance by carrying over deletes in RAM.  We
should eventually do the same thing with norms & field cache.  No
reason to write to disk, then right away read again.

{quote}
* You index docs, nobody sees them, nor deletions.
* You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
* Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.
{quote}

But this is not a commit, if docs/deletes are written down into RAM?
Ie, commit could return, then the machine could crash, and you've lost
changes?  Commit should go through to stable storage before returning?
Maybe I'm just missing the big picture of what you're proposing
here...

Also, you can build all this out on top of Lucene today?  Zoie is a
proof point of this.  (Actually: how does your proposal differ from
Zoie?  Maybe that'd help shed light...).

bq. I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs.

It is not that simple -- if you skip the fsync, and OS crashes/you
lose power, your index can easily become corrupt.  The resulting
CheckIndex -fix can easily need to remove large segments.

The OS's write cache makes no gurantees on the order in which the
files you've written find their way to disk.

Another option (we've discussed this) would be journal file approach
(ie transaction log, like most DBs use).  You only have one file to
fsync, and you replay to recover.  But that'd be a big change for
Lucene, would add complexity, and can be accomplished outside of
Lucene if an app really wants to...

Let me try turning this around: in your componentization of
SegmentReader, why does it matter who's tracking which components are
needed to make up a given SR?  In the IndexReader.open case, it's a
SegmntInfos instance (obtained by loading segments_N file from disk).
In the NRT case, it's also a SegmentInfos instace (the one IW is
privately keeping track of and only publishing on commit).  At the
component level, creating the SegmentReader should be no different?


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789714#action_12789714 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
> I say it's better to sacrifice write guarantee.

I don't grok why sync is the default, especially given how sketchy hardware
drivers are about obeying fsync:

{panel}
But, beware: some hardware devices may in fact cache writes even during
fsync, and return before the bits are actually on stable storage, to give the
appearance of faster performance.
{panel}
{quote}

It's unclear how often this scare-warning is true in practice (scare
warnings tend to spread very easily without concrete data); it's in
the javadocs for completeness sake.  I expect (though have no data to
back this up...) that most OS/IO systems "out there" do properly
implement fsync.

{quote}
IMO, it should have been an option which defaults to false, to be enabled only by
users who have the expertise to ensure that fsync() is actually doing what
it advertises. But what's done is done (and Lucy will probably just do something
different.)
{quote}

I think that's a poor default (trades safety for performance), unless
Lucy eg uses a transaction log so you can concretely bound what's lost
on crash/power loss.  Or, if you go back to autocommitting I guess...

If we did this in Lucene, you can have unbounded corruption.  It's not
just the last few minutes of updates...

So, I don't think we should even offer the option to turn it off.  You
can easily subclass your FSDir impl and make sync() a no-op if your
really want to...

{quote}
With regard to Lucene NRT, though, turning sync() off would really help. If and
when some sort of settings class comes about, an enableSync(boolean enabled)
method seems like it would come in handy.
{quote}

You don't need to turn off sync for NRT -- that's the whole point.  It
gives you a reader without syncing the files.  Really, this is your
safety tradeoff -- it means you can commit less frequently, since the
NRT reader can search the latest updates.  But, your app has
complete control over how it wants to to trade safety for performance.


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789716#action_12789716 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

bq. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

Right, one can always not use NRT and build their own layers on top.

But, Zoie has *alot* of code to accomplish this -- the devil really is
in the details to "simply write first to a RAMDir".  This is why I'd
like Earwin to look @ Zoie and clarify his proposed approach, in
contrast...

Actually, here's a question: how quickly can Zoie turn around a
commit()?  Seems like it must take more time than Lucene, since it does
extra stuff (flush RAM buffers to disk, materialize deletes) before
even calling IW.commit.

At the end of the day, any NRT system has to trade safety for
performance (bypass the sync call in the NRT reader)....

bq. The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

I don't consider NRT today to be a tight coupling (eg, the pending
refactoring of IW would nicely separate it out).  If we implement the
IR that searches DW's RAM buffer, then I'd agree ;)


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789905#action_12789905 ]

Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> I think that's a poor default (trades safety for performance), unless
> Lucy eg uses a transaction log so you can concretely bound what's lost
> on crash/power loss. Or, if you go back to autocommitting I guess...

Search indexes should not be used for canonical data storage -- they should be
built *on top of* canonical data storage.  Guarding against power failure
induced corruption in a database is an imperative.  Guarding against power
failure induced corruption in a search index is a feature, not an imperative.

Users have many options for dealing with the potential for such corruption.
You can go back to your canonical data store and rebuild your index from
scratch when it happens.  In a search cluster environment, you can rsync a
known-good copy from another node.  Potentially, you might enable
fsync-before-commit and keep your own transaction log.  However, if the time
it takes to rebuild or recover an index from scratch would have caused you
unacceptable downtime, you can't possibly be operating in a
single-point-of-failure environment where a power failure could take you down
anyway -- so other recovery options are available to you.

Turning on fsync is only one step towards ensuring index integrity; others
steps involve making decisions about hard drives, RAID arrays, failover
strategies, network and off-site backups, etc, and are outside of our domain
as library authors.  We cannot meet the needs of users who need guaranteed
index integrity on our own.

For everybody else, what turning on fsync by default achieves is to make an
exceedingly rare event rarer.  That's valuable, but not essential.  My
argument is that since the search indexes should not be used for canonical
storage, and since fsync is not testably reliable and not sufficient on its
own, it's a good engineering compromise to prioritize performance.  

> If we did this in Lucene, you can have unbounded corruption. It's not
> just the last few minutes of updates...

Wasn't that a possibility under autocommit as well?   All it takes is for the
OS to finish flushing the new snapshot file to persistent storage before it
finishes flushing a segment data file needed by that snapshot, and for the
power failure to squeeze in between.

In practice, locality of reference is going to make the window very very
small, since those two pieces of data will usually get written very close to
each other on the persistent media.

I've seen a lot more messages to our user lists over the years about data
corruption caused by bugs and misconfigurations than by power failures.

But really, that's as it should be.  Ensuring data integrity to the degree
required by a database is costly -- it requires far more rigorous testing, and
far more conservative development practices.  If we accept that our indexes
must *never* go corrupt, it will retard innovation.

Of course we should work very hard to prevent index corruption.  However, I'm
much more concerned about stuff like silent omission of search results due to
overzealous, overly complex optimizations than I am about problems arising
from power failures.  When a power failure occurs, you know it -- so you get
the opportunity to fsck the disk, run checkIndex(), perform data integrity
reconciliation tests against canonical storage, and if anything fails, take
whatever recovery actions you deem necessary.

> You don't need to turn off sync for NRT - that's the whole point. It
> gives you a reader without syncing the files.

I suppose this is where Lucy and Lucene differ.  Thanks to mmap and the
near-instantaneous reader opens it has enabled, we don't need to keep a
special reader alive.  Since there's no special reader, the only way to get
data to a search process is to go through a commit.  But if we fsync on every
commit, we'll drag down indexing responsiveness.  Fishishing the commit and
returning control to client code as quickly as possible is a high priority for
us.

Furthermore, I don't want us to have to write the code to support a
near-real-time reader hanging off of IndexWriter a la Lucene.  The
architectural discussions have made for very interesting reading, but the
design seems to be tricky to pull off, and implementation simplicity in core
search code is a high priority for Lucy.  It's better for Lucy to kill two
birds with one stone and concentrate on making *all* index opens fast.

> Really, this is your safety tradeoff - it means you can commit less
> frequently, since the NRT reader can search the latest updates. But, your
> app has complete control over how it wants to to trade safety for
> performance.

So long as fsync is an option, the app always has complete control, regardless
of whether the default setting is fsync or no fsync.

If a Lucene app wanted to increase NRT responsiveness and throughput, and if
absolute index integrity wasn't a concern because it had been addressed
through other means (e.g. multi-node search cluster), would turning off fsync
speed things up under any of the proposed designs?

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789971#action_12789971 ]

Jason Rutherglen commented on LUCENE-2026:
------------------------------------------

I think large scale NRT installations may eventually require a
distributed transaction log. The implementation details have yet
to be determined however it could potentially solve the issue of
data loss being discussed. One candidate is a combo of Zookeeper
+ Bookeeper. I would venture to guess this could be implemented
as a part of Solr, however we've got a lot of work to do for
Solr to be reasonably NRT efficient (see the tracking issue
SOLR-1606), and we're just starting on the Zookeeper
implementation SOLR-1277...

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790988#action_12790988 ]

Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
bq. I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess...

Search indexes should not be used for canonical data storage - they should be
built on top of canonical data storage.
{quote}

I agree with that, in theory, but I think in practice it's too
idealistic to force/expect apps to meet that ideal.

I expect for many apps it's a major cost to unexpectedly lose the
search index on power loss / OS crash.

{quote}
Users have many options for dealing with the potential for such corruption.
You can go back to your canonical data store and rebuild your index from
scratch when it happens. In a search cluster environment, you can rsync a
known-good copy from another node. Potentially, you might enable
fsync-before-commit and keep your own transaction log. However, if the time
it takes to rebuild or recover an index from scratch would have caused you
unacceptable downtime, you can't possibly be operating in a
single-point-of-failure environment where a power failure could take you down
anyway - so other recovery options are available to you.

Turning on fsync is only one step towards ensuring index integrity; others
steps involve making decisions about hard drives, RAID arrays, failover
strategies, network and off-site backups, etc, and are outside of our domain
as library authors. We cannot meet the needs of users who need guaranteed
index integrity on our own.
{quote}

Yes, high availability apps will already take their measures to
protect the search index / recovery process, going beyond fsync.
EG, making a hot backup of Lucene index is now straightforwarded.

{quote}
For everybody else, what turning on fsync by default achieves is to make an
exceedingly rare event rarer. That's valuable, but not essential. My
argument is that since the search indexes should not be used for canonical
storage, and since fsync is not testably reliable and not sufficient on its
own, it's a good engineering compromise to prioritize performance.
{quote}

Losing power to the machine, or OS crash, or the user doing a hard
power down because OS isn't responding, I think are not actually
*that* uncommon in an end user setting.  Think of a desktop app
embedding Lucene/Lucy...

{quote}
bq. If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates...

Wasn't that a possibility under autocommit as well? All it takes is for the
OS to finish flushing the new snapshot file to persistent storage before it
finishes flushing a segment data file needed by that snapshot, and for the
power failure to squeeze in between.
{quote}

Not after LUCENE-1044... autoCommit simply called commit() at certain
opportune times (after finish big merges), which does the right thing
(I hope!).  The segments file is not written until all files it
references are sync'd.

{quote}
In practice, locality of reference is going to make the window very very
small, since those two pieces of data will usually get written very close to
each other on the persistent media.
{quote}

Not sure about that -- it depends on how effectively the OS's write cache
"preserves" that locality.

{quote}
I've seen a lot more messages to our user lists over the years about data
corruption caused by bugs and misconfigurations than by power failures.
{quote}

I would agree, though, I think it may be a sampling problem... ie
people whose machines crashed and they lost the search index would
often not raise it on the list (vs say a persistent config issue that keeps
leading to corruption).

{quote}
But really, that's as it should be. Ensuring data integrity to the degree
required by a database is costly - it requires far more rigorous testing, and
far more conservative development practices. If we accept that our indexes
must never go corrupt, it will retard innovation.
{quote}

It's not really that costly, with NRT -- you can get a searcher on the
index without paying the commit cost.  And now you can call commit
however frequently you need to.  Quickly turning around a new
searcher, and how frequently you commit, are now independent.

Also, having the app explicitly decouple these two notions keeps the
door open for future improvements.  If we force absolutely all sharing
to go through the filesystem then that limits the improvements we can
make to NRT.

{quote}
Of course we should work very hard to prevent index corruption. However, I'm
much more concerned about stuff like silent omission of search results due to
overzealous, overly complex optimizations than I am about problems arising
from power failures. When a power failure occurs, you know it - so you get
the opportunity to fsck the disk, run checkIndex(), perform data integrity
reconciliation tests against canonical storage, and if anything fails, take
whatever recovery actions you deem necessary.
{quote}

Well... I think search performance is important, and we should pursue it
even if we risk bugs.

{quote}
bq. You don't need to turn off sync for NRT - that's the whole point. It gives you a reader without syncing the files.

I suppose this is where Lucy and Lucene differ. Thanks to mmap and the
near-instantaneous reader opens it has enabled, we don't need to keep a
special reader alive. Since there's no special reader, the only way to get
data to a search process is to go through a commit. But if we fsync on every
commit, we'll drag down indexing responsiveness. Fishishing the commit and
returning control to client code as quickly as possible is a high priority for
us.
{quote}

NRT reader isn't that special -- the only things different is 1) it
loaded the segments_N "file" from IW instead of the filesystem, and 2)
it uses a reader pool to "share" the underlying SegmentReaders with
other places that have loaded them.  I guess, if Lucy won't allow
this, then, yes, forcing a commit in order to reopen is very costly,
and so sacrificing safety is a tradeoff you have to make.

Alternatively, you could keep the notion "flush" (an unsafe commit)
alive?  You write the segments file, but make no effort to ensure it's
durability (and also preserve the last "true" commit).  Then a normal
IR.reopen suffices...

{quote}
Furthermore, I don't want us to have to write the code to support a
near-real-time reader hanging off of IndexWriter a la Lucene. The
architectural discussions have made for very interesting reading, but the
design seems to be tricky to pull off, and implementation simplicity in core
search code is a high priority for Lucy. It's better for Lucy to kill two
birds with one stone and concentrate on making all index opens fast.
{quote}

But shouldn't you at least give an option for index durability?  Even
if we disagree about the default?

{quote}
bq. Really, this is your safety tradeoff - it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.

So long as fsync is an option, the app always has complete control,
regardless of whether the default setting is fsync or no fsync.
{quote}

Well it is an "option" in Lucene -- "it's just software"  ;)  I don't
want to make it easy to be unsafe.  Lucene shouldn't sacrifice safety
of the index... and with NRT there's no need to make that tradeoff.

{quote}
If a Lucene app wanted to increase NRT responsiveness and throughput, and if
absolute index integrity wasn't a concern because it had been addressed
through other means (e.g. multi-node search cluster), would turning off fsync
speed things up under any of the proposed designs?
{quote}

Yes, turning off fsync would speed things up -- you could fall back to
simple reopen and get good performance (NRT should still be faster
since the readers are pooled).  The "use RAMDir on top of Lucene"
designs would be helped less since fsync is a noop in RAMDir.


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791549#action_12791549 ]

Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

>> Wasn't that a possibility under autocommit as well? All it takes is for the
>> OS to finish flushing the new snapshot file to persistent storage before it
>> finishes flushing a segment data file needed by that snapshot, and for the
>> power failure to squeeze in between.
>
> Not after LUCENE-1044... autoCommit simply called commit() at certain
> opportune times (after finish big merges), which does the right thing (I
> hope!). The segments file is not written until all files it references are
> sync'd.

FWIW, autoCommit doesn't really have a place in Lucy's
one-segment-per-indexing-session model.

Revisiting the LUCENE-1044 threads, one passage stood out:

{panel}
    http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

    This is why in a db system, the only file that is sync'd is the log
    file - all other files can be made "in sync" from the log file - and
    this file is normally striped for optimum write performance. Some
    systems have special "log file drives" (some even solid state, or
    battery backed ram) to aid the performance.
{panel}

The fact that we have to sync all files instead of just one seems sub-optimal.

Yet Lucene is not well set up to maintain a transaction log.  The very act of
adding a document to Lucene is inherently lossy even if all fields are stored,
because doc boost is not preserved.

> Also, having the app explicitly decouple these two notions keeps the
> door open for future improvements. If we force absolutely all sharing
> to go through the filesystem then that limits the improvements we can
> make to NRT.

However, Lucy has much more to gain going through the file system than Lucene
does, because we don't necessarily incur JVM startup costs when launching a
new process.  The Lucene approach to NRT -- specialized reader hanging off of
writer -- is constrained to a single process.  The Lucy approach -- fast index
opens enabled by mmap-friendly index formats -- is not.

The two approaches aren't mutually exclusive.  It will be possible to augment
Lucy with a specialized index reader within a single process.  However, A)
there seems to be a lot of disagreement about just how to integrate that
reader, and B) there seem to be ways to bolt that functionality on top of the
existing classes.  Under those circumstances, I think it makes more sense to
keep that feature external for now.

> Alternatively, you could keep the notion "flush" (an unsafe commit)
> alive? You write the segments file, but make no effort to ensure it's
> durability (and also preserve the last "true" commit). Then a normal
> IR.reopen suffices...

That sounds promising.  The semantics would differ from those of Lucene's
flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a
"flushed" snapshot differently, either by adding an "fsync" property to the
snapshot file that would be false after a flush() but true after a commit(),
or by encoding the property within the snapshot filename.  The file purger
would have to ensure that all index files referenced by either the last
committed snapshot or the last flushed snapshot were off limits.  A rollback()
would zap all changes since the last commit().  

Such a scheme allows the the top level app to avoid the costs of fsync while
maintaining its own transaction log -- perhaps with the optimizations
suggested above (separate disk, SSD, etc).

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...

< Prev | 1 - 2 | Next >