|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 | Next > |
|
|
[jira] Created: (LUCENE-1458) Further steps towards flexible indexingFurther steps towards flexible indexing
--------------------------------------- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1458.patch I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term & long offset (not a TermInfo). At seek points, tis encodes term & freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading & writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new "flex" API for iterating through the fields, terms, docs and positions: {code} FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --------------------------------------- Attachment: LUCENE-1458.patch > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648613#action_12648613 ] Mark Miller commented on LUCENE-1458: ------------------------------------- Hmmm...I think something is missing - FormatPostingsPositionsReader? > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --------------------------------------- Attachment: LUCENE-1458.patch Woops, sorry... I was missing a bunch of files. Try this one? > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingMichael,
Can you describe a bit more about why the term dictionary index is no longer required? Jason On Tue, Nov 18, 2008 at 7:41 AM, Michael McCandless (JIRA) <jira@...> wrote:
|
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648727#action_12648727 ] Marvin Humphrey commented on LUCENE-1458: ----------------------------------------- The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size. For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek -- scanning through the primary term dictionary -- involves comparatively more processor power than this, the increased costs would be acceptable. Advantages: * Multiple forks can all share the same system buffer, reducing per-process memory footprint. * The cost to read in the index term dictionary during IndexReader startup drops to zero. * The OS caches for the index term dictionaries can either be allowed to warm naturally, or can be nudged into virtual memory via e.g. "cat /path/to/index/*.tis > /dev/null". > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingIt's not that it isn't required -- it's just that it stores less info
than before. I changed the _X.tis format such that at each seekable point (every 128 terms by default), everything is written as absolutes (term text, freq & prox offset). This means the _X.tii file only has to store the indexed term & offset into the _X.tis file. Then all we need to load into RAM are two column-stride arrays: the long offset (into the _X.tis file) and the terms. Also, in RAM I store the terms as String[] within a per-field class, instead of Term[], which saves the object & 2 pointer overhead. It's similar to how video muxers store their index into key frames, where a key frame is an "absolute" frame that can be decoded without seeing prior frames. I think RAM savings should be at least 50% for "typical" terms (avg 10 chars say). Longer avg term length will see less savings. But, this savings is only your term index, so if your tii file is smallish net/net it won't reduce RAM usage that much. When seeking is done, we look in the index to find the nearest spot in _X.tis before the term we are looking for, jump there, read the absolutes for that next() term, and then read deltas to continue scanning. This is coded up in the FormatPostingsTermsDictWriter/Reader classes. Mike Jason Rutherglen wrote: > Michael, > > Can you describe a bit more about why the term dictionary index is > no longer required? > > Jason > > On Tue, Nov 18, 2008 at 7:41 AM, Michael McCandless (JIRA) <jira@... > > wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Michael McCandless updated LUCENE-1458: > --------------------------------------- > > Attachment: LUCENE-1458.patch > > Woops, sorry... I was missing a bunch of files. Try this one? > > > Further steps towards flexible indexing > > --------------------------------------- > > > > Key: LUCENE-1458 > > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Index > > Affects Versions: 2.9 > > Reporter: Michael McCandless > > Assignee: Michael McCandless > > Priority: Minor > > Fix For: 2.9 > > > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch > > > > > > I attached a very rough checkpoint of my current patch, to get early > > feedback. All tests pass, though back compat tests don't pass due > to > > changes to package-private APIs plus certain bugs in tests that > > happened to work (eg call TermPostions.nextPosition() too many > times, > > which the new API asserts against). > > [Aside: I think, when we commit changes to package-private APIs such > > that back-compat tests don't pass, we could go back, make a branch > on > > the back-compat tag, commit changes to the tests to use the new > > package private APIs on that branch, then fix nightly build to use > the > > tip of that branch?o] > > There's still plenty to do before this is committable! This is a > > rather large change: > > * Switches to a new more efficient terms dict format. This still > > uses tii/tis files, but the tii only stores term & long offset > > (not a TermInfo). At seek points, tis encodes term & freq/prox > > offsets absolutely instead of with deltas delta. Also, tis/tii > > are structured by field, so we don't have to record field number > > in every term. > > . > > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > > . > > RAM usage when loading terms dict index is significantly less > > since we only load an array of offsets and an array of String > (no > > more TermInfo array). It should be faster to init too. > > . > > This part is basically done. > > * Introduces modular reader codec that strongly decouples terms > dict > > from docs/positions readers. EG there is no more TermInfo used > > when reading the new format. > > . > > There's nice symmetry now between reading & writing in the codec > > chain -- the current docs/prox format is captured in: > > {code} > > FormatPostingsTermsDictWriter/Reader > > FormatPostingsDocsWriter/Reader (.frq file) and > > FormatPostingsPositionsWriter/Reader (.prx file). > > {code} > > This part is basically done. > > * Introduces a new "flex" API for iterating through the fields, > > terms, docs and positions: > > {code} > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > > {code} > > This replaces TermEnum/Docs/Positions. SegmentReader emulates > the > > old API on top of the new API to keep back-compat. > > > > Next steps: > > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > > fix any hidden assumptions. > > * Expose new API out of IndexReader, deprecate old API but emulate > > old API on top of new one, switch all core/contrib users to the > > new API. > > * Maybe switch to AttributeSources as the base class for > TermsEnum, > > DocsEnum, PostingsEnum -- this would give readers API > flexibility > > (not just index-file-format flexibility). EG if someone wanted > > to store payload at the term-doc level instead of > > term-doc-position level, you could just add a new attribute. > > * Test performance & iterate. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648739#action_12648739 ] Michael McCandless commented on LUCENE-1458: -------------------------------------------- bq. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Interesting! I've been wondering what you're up to over on KS, Marvin :) I'm not sure it'll be a win in practice: I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache. Plus during that binary search the IO system is loading whole pages into the IO cache, even though you'll only peak at the first few bytes of each. We could also explore something in-between, eg it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files, then we could use that to index the terms dict. You could choose to spend dedicated process RAM on the higher levels of the skip tree, and then tentatively trust IO cache for the lower levels. I'd like to eventually make the TermsDict index pluggable so one could swap in different indexers like this (it's not now). > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --------------------------------------- Attachment: LUCENE-1458.patch [Attached patch] To test whether the new pluggable codec approach is flexible enough, I coded up "pulsing" (described in detail in http://citeseer.ist.psu.edu/cutting90optimizations.html), where freq/prox info is inlined into the terms dict if the term freq is < N. It was wonderfully simple :) I just had to create a reader & a writer, and then switch the places that read (SegmentReader) and write (SegmentMerger, FreqProxTermsWriter) to use the new pulsing codec instead of the default one. The pulsing codec can "wrap" any other codec, ie, when a term is written, if the term's freq is < N, then it's inlined into the terms dict with the pulsing writer, else it's fed to the other codec for it to do whatever it normally would. The two codecs are strongly decoupled, so we can mix & match pulsing with other codecs like pfor. All tests pass with this pulsing codec. As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie terms that occur only in one document are inlined into the terms dict). 5.4M terms get inlined (only 1 doc) and 2.2M terms are not (> 1 doc). The final size of the index (after optimizing) was a bit smaller with pulsing (1120 MB vs 1131 MB). > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648781#action_12648781 ] Michael Busch commented on LUCENE-1458: --------------------------------------- I'll look into this patch soon. Just wanted to say: I'm really excited about the progress here, this is cool stuff! Great job... > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingNice! I'm looking at using PForDelta in creating the tag index type of system. Do you think there is an elegant way to add realtime updates to individual fields using the current (or future) flexible indexing API?
On Tue, Nov 18, 2008 at 2:11 PM, Michael McCandless (JIRA) <jira@...> wrote:
|
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingOn a side note, and I have not looked at the flexible indexing API enough to know if there is some equivalent but are we moving to something like MG4J's MutableString http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html instead of java.lang.String objects?
On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA) <jira@...> wrote:
|
|
|
Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing> it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files
+1 on this idea. Using skip lists for the term index would be an improvement. On Tue, Nov 18, 2008 at 12:27 PM, Michael McCandless (JIRA) <jira@...> wrote:
|
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648835#action_12648835 ] Marvin Humphrey commented on LUCENE-1458: ----------------------------------------- > I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache. In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard -- in which case we won't have to worry about the OS swapping out those pages. I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems -- if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised. In other words... * I trust the OS to do a decent enough job on underpowered systems. * High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory. More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>. > Plus during that binary search the IO system is loading whole pages into > the IO cache, even though you'll only peak at the first few bytes of each. I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer. But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that. Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. > We could also explore something in-between, eg it'd be nice to > genericize MultiLevelSkipListWriter so that it could index arbitrary > files, then we could use that to index the terms dict. You could > choose to spend dedicated process RAM on the higher levels of the skip > tree, and then tentatively trust IO cache for the lower levels. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. ;) So I imagine we'll choose different paths. > I'd like to eventually make the TermsDict index pluggable so one could > swap in different indexers like this (it's not now). If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648839#action_12648839 ] Michael Busch commented on LUCENE-1458: --------------------------------------- {quote} We could also explore something in-between, eg it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files, then we could use that to index the terms dict. {quote} Hmm, +1 for generalizing the MultiLevelSkipListWriter/Reader so that we can re-use it for different (custom) posting-list formats easily. However, I'm not so sure if it's the right approach for a dictionary. A skip list is optimized for skipping forward (as the name says), so excellent for positing lists, which are always read from "left to right". However, in the term dictionary you do a binary search for the lookup term. So something like a B+Tree would probably work better. Then you can decide similar to the MultiLevelSkipListWriter how many of the upper levels you want to keep in memory and control memory consumption. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648971#action_12648971 ] Michael McCandless commented on LUCENE-1458: -------------------------------------------- bq. So something like a B+Tree would probably work better. I agree, btree is a better fit, though we don't need insertion & deletion operations since each segment is write once. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649028#action_12649028 ] Michael McCandless commented on LUCENE-1458: -------------------------------------------- {quote} In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard - in which case we won't have to worry about the OS swapping out those pages. I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems - if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised. In other words... * I trust the OS to do a decent enough job on underpowered systems. * High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory. {quote} These are the two extremes, but, I think most common are all the apps in between. Take a large Jira instance, where the app itself is also consuming alot of RAM, doing alot of its own IO, etc., where perhaps searching is done infrequently enough relative to other operations that the OS may no longer think the pages you hit for the terms index are hot enough to keep around. bq. More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>. This is a good read, but I find it overly trusting of VM. How can the VM system possibly make good decisions about what to swap out? It can't know if a page is being used for terms dict index, terms dict, norms, stored fields, postings. LRU is not a good policy, because some pages (terms index) are far far more costly to miss than others. From Java we have even more ridiculous problems: sometimes the OS swaps out garbage... and then massive swapping takes place when GC runs, swapping back in the garbage only to then throw it away. Ugh! I think we need to aim for *consistency*: a given search should not suddenly take 10 seconds because the OS decided to swap out a few critical structures (like the term index). Unfortunately we can't really achieve that today, especially from Java. I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make stupid VM decisions: if I run something that does a single-pass through many GB of on-disk data (eg re-encoding a video), it then swaps out the vast majority of my apps even though I have 6 GB RAM. I hit tons (many seconds) of swapping just switching back to my mail client. It's infuriating. I've seen Linux do the same thing, but at least Linux let's you tune this behavior ("swappiness"); I had to disable swapping entirely on my desktop. Similarly, when a BG merge is burning through data, or say backup kicks off and moves many GB, or the simple act of iterating through a big postings list, the OS will gleefully evict my terms index or norms in order to populate its IO cache with data it will need again for a very long time. I bet the VM system fails to show graceful degradation: if I don't have enough RAM to hold my entire index, then walking through postings lists will evict my terms index and norms, making all searches slower. In the ideal world, an IndexReader would be told how much RAM to use. It would spend that RAM wisely, eg first on the terms index, second on norms, third maybe on select column-stride fields, etc. It would pin these pages so the OS couldn't swap them out (can't do this from java... though as a workaround we could use a silly thread). Or, if the OS found itself tight on RAM, it would ask the app to free things up instead of blindly picking pages to swap out, which does not happen today. From Java we could try using WeakReference but I fear the communication from the OS -> JRE is too weak. IE I'd want my WeakReference cleared only when the OS is threatening to swap out my data structure. {quote} > Plus during that binary search the IO system is loading whole pages into > the IO cache, even though you'll only peak at the first few bytes of each. I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer. But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that. {quote} I'm not convinced this'll be a win in practice. You are now paying an even higher overhead cost for each "check" of your binary search, especially with something like pulsing which inlines more stuff into the terms dict. I agree it's simpler, but I think that's trumped by the performance hit. In Lucene java, the concurrency model we are aiming for is a single JVM sharing a single instance of IndexReader. I do agree, if fork() is the basis of your concurrency model then sharing pages becomes critical. However, modern OSs implement copy-on-write sharing of VM pages after a fork, so that's another good path to sharing? bq. Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. Have you tried any actual tests swapping these approaches in as your terms index impl? Tests of fully hot and fully cold ends of the spectrum would be interesting, but also tests where a big segment merge or a backup is running in the background... bq. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. That's a nice goal. Our biggest cost in Lucene is warming the FieldCache, used for sorting, function queries, etc. Column-stride fields should go a ways towards improving this. bq. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths. I think if we make the pluggable API simple, and capture the complexity inside each impl, such that it can be well tested in isolation, it's acceptable. bq. If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob. In my approach here, the blob is opaque to the terms dict reader: it simply seeks to the right spot in the tis file, and then asks the codec to decode the entry. TermsDictReader is entirely unaware of what/how is stored there. > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexingMichael B: Are you interested in making column stride fields realtime and use the btree for the terms? This is an idea I started on I called tag index where the postings are divided into blocks. The blocks can then be replaced in memory with periodic flush to disk as the in ram postings grows.
Michael M: How would the term compression be handled in a btree model? On Wed, Nov 19, 2008 at 2:29 AM, Michael McCandless (JIRA) <jira@...> wrote:
|
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingFlexible indexing doesn't try to address the real-time updates; it only tries to make index writing & reading modular so that you can plug in your own codecs for your own format. Also, so far, I've only worked on the postings lists. I think for column-stride fields it should be easier to implement updates. Mike Jason Rutherglen wrote: > Nice! I'm looking at using PForDelta in creating the tag index type > of system. Do you think there is an elegant way to add realtime > updates to individual fields using the current (or future) flexible > indexing API? > > On Tue, Nov 18, 2008 at 2:11 PM, Michael McCandless (JIRA) <jira@... > > wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Michael McCandless updated LUCENE-1458: > --------------------------------------- > > Attachment: LUCENE-1458.patch > > > [Attached patch] > > To test whether the new pluggable codec approach is flexible enough, I > coded up "pulsing" (described in detail in > http://citeseer.ist.psu.edu/cutting90optimizations.html), where > freq/prox info is inlined into the terms dict if the term freq is < N. > > It was wonderfully simple :) I just had to create a reader & a writer, > and then switch the places that read (SegmentReader) and write > (SegmentMerger, FreqProxTermsWriter) to use the new pulsing codec > instead of the default one. > > The pulsing codec can "wrap" any other codec, ie, when a term is > written, if the term's freq is < N, then it's inlined into the terms > dict with the pulsing writer, else it's fed to the other codec for it > to do whatever it normally would. The two codecs are strongly > decoupled, so we can mix & match pulsing with other codecs like pfor. > > All tests pass with this pulsing codec. > > As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie > terms that occur only in one document are inlined into the terms > dict). 5.4M terms get inlined (only 1 doc) and 2.2M terms are not (> > 1 doc). The final size of the index (after optimizing) was a bit > smaller with pulsing (1120 MB vs 1131 MB). > > > > Further steps towards flexible indexing > > --------------------------------------- > > > > Key: LUCENE-1458 > > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Index > > Affects Versions: 2.9 > > Reporter: Michael McCandless > > Assignee: Michael McCandless > > Priority: Minor > > Fix For: 2.9 > > > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch > > > > > > I attached a very rough checkpoint of my current patch, to get early > > feedback. All tests pass, though back compat tests don't pass due > to > > changes to package-private APIs plus certain bugs in tests that > > happened to work (eg call TermPostions.nextPosition() too many > times, > > which the new API asserts against). > > [Aside: I think, when we commit changes to package-private APIs such > > that back-compat tests don't pass, we could go back, make a branch > on > > the back-compat tag, commit changes to the tests to use the new > > package private APIs on that branch, then fix nightly build to use > the > > tip of that branch?o] > > There's still plenty to do before this is committable! This is a > > rather large change: > > * Switches to a new more efficient terms dict format. This still > > uses tii/tis files, but the tii only stores term & long offset > > (not a TermInfo). At seek points, tis encodes term & freq/prox > > offsets absolutely instead of with deltas delta. Also, tis/tii > > are structured by field, so we don't have to record field number > > in every term. > > . > > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > > . > > RAM usage when loading terms dict index is significantly less > > since we only load an array of offsets and an array of String > (no > > more TermInfo array). It should be faster to init too. > > . > > This part is basically done. > > * Introduces modular reader codec that strongly decouples terms > dict > > from docs/positions readers. EG there is no more TermInfo used > > when reading the new format. > > . > > There's nice symmetry now between reading & writing in the codec > > chain -- the current docs/prox format is captured in: > > {code} > > FormatPostingsTermsDictWriter/Reader > > FormatPostingsDocsWriter/Reader (.frq file) and > > FormatPostingsPositionsWriter/Reader (.prx file). > > {code} > > This part is basically done. > > * Introduces a new "flex" API for iterating through the fields, > > terms, docs and positions: > > {code} > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > > {code} > > This replaces TermEnum/Docs/Positions. SegmentReader emulates > the > > old API on top of the new API to keep back-compat. > > > > Next steps: > > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > > fix any hidden assumptions. > > * Expose new API out of IndexReader, deprecate old API but emulate > > old API on top of new one, switch all core/contrib users to the > > new API. > > * Maybe switch to AttributeSources as the base class for > TermsEnum, > > DocsEnum, PostingsEnum -- this would give readers API > flexibility > > (not just index-file-format flexibility). EG if someone wanted > > to store payload at the term-doc level instead of > > term-doc-position level, you could just add a new attribute. > > * Test performance & iterate. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexingMutableString looks cool but totally different from flexible indexing. Mike Jason Rutherglen wrote: > On a side note, and I have not looked at the flexible indexing API > enough to know if there is some equivalent but are we moving to > something like MG4J's MutableString http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html > instead of java.lang.String objects? > > On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA) <jira@... > > wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Michael McCandless updated LUCENE-1458: > --------------------------------------- > > Attachment: LUCENE-1458.patch > > > Further steps towards flexible indexing > > --------------------------------------- > > > > Key: LUCENE-1458 > > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Index > > Affects Versions: 2.9 > > Reporter: Michael McCandless > > Assignee: Michael McCandless > > Priority: Minor > > Fix For: 2.9 > > > > Attachments: LUCENE-1458.patch > > > > > > I attached a very rough checkpoint of my current patch, to get early > > feedback. All tests pass, though back compat tests don't pass due > to > > changes to package-private APIs plus certain bugs in tests that > > happened to work (eg call TermPostions.nextPosition() too many > times, > > which the new API asserts against). > > [Aside: I think, when we commit changes to package-private APIs such > > that back-compat tests don't pass, we could go back, make a branch > on > > the back-compat tag, commit changes to the tests to use the new > > package private APIs on that branch, then fix nightly build to use > the > > tip of that branch?o] > > There's still plenty to do before this is committable! This is a > > rather large change: > > * Switches to a new more efficient terms dict format. This still > > uses tii/tis files, but the tii only stores term & long offset > > (not a TermInfo). At seek points, tis encodes term & freq/prox > > offsets absolutely instead of with deltas delta. Also, tis/tii > > are structured by field, so we don't have to record field number > > in every term. > > . > > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > > . > > RAM usage when loading terms dict index is significantly less > > since we only load an array of offsets and an array of String > (no > > more TermInfo array). It should be faster to init too. > > . > > This part is basically done. > > * Introduces modular reader codec that strongly decouples terms > dict > > from docs/positions readers. EG there is no more TermInfo used > > when reading the new format. > > . > > There's nice symmetry now between reading & writing in the codec > > chain -- the current docs/prox format is captured in: > > {code} > > FormatPostingsTermsDictWriter/Reader > > FormatPostingsDocsWriter/Reader (.frq file) and > > FormatPostingsPositionsWriter/Reader (.prx file). > > {code} > > This part is basically done. > > * Introduces a new "flex" API for iterating through the fields, > > terms, docs and positions: > > {code} > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > > {code} > > This replaces TermEnum/Docs/Positions. SegmentReader emulates > the > > old API on top of the new API to keep back-compat. > > > > Next steps: > > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > > fix any hidden assumptions. > > * Expose new API out of IndexReader, deprecate old API but emulate > > old API on top of new one, switch all core/contrib users to the > > new API. > > * Maybe switch to AttributeSources as the base class for > TermsEnum, > > DocsEnum, PostingsEnum -- this would give readers API > flexibility > > (not just index-file-format flexibility). EG if someone wanted > > to store payload at the term-doc level instead of > > term-doc-position level, you could just add a new attribute. > > * Test performance & iterate. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |