On Mon, Jun 29, 2009 at 6:28 AM, Michael McCandless <
lucene@...> wrote:
> On Sun, Jun 28, 2009 at 9:08 PM, Nigel<
nigelspleen@...> wrote:
> >> Unfortunately the TermInfos must still be hit to look up the
> >> freq/proxOffset in the postings files.
> >
> > But for that data you only have to hit the TermInfos for the terms you're
> > searching, correct? So, assuming that there are vastly more terms in the
> > index than you ever actually use in a query, we could rely on the LRU
> cache
> > to keep the queried TermInfos around, rather than loading all of them
> > up-front. This was a hypothesis based on some tracing through the code
> but
> > not a lot of knowledge of Lucene internals, so please steer me back to
> > reality if necessary. (-:
>
> Right, it's only the terms in your query that need to be looked up.
>
> There's already an LRU cache for the Term -> TermInfos lookup, in
> TermInfosReader (hard wired to size 1024). It was created as a "short
> term" cache so that queries that look up the same term twice (eg once
> for idf and once to get freq/prox postings position), would result in
> only one real lookup. You could try privately experimenting w/ that
> value to see if you can get a reasonable it rate?
Exactly, the TermInfosReader cache is what I was thinking of. Thanks for
the confirmation; I'll give that a try.
Lucene only loads the "indexed" terms (= every 128th) into RAM, by default.
Ah, I was confused by the index divisor being 1 by default: I thought it
meant that all terms were being loaded. I see now in SegmentTermEnum that
the every-128th behavior is implemented at a lower level.
But I'm even more confused about why we have so many terms in memory. A
heap dump shows over 270 million TermInfos, so if that's only 128th of the
total then we REALLY have a lot of terms. (-: We do have a lot of docs
(about 250 million), and we do have a couple unique per-document values, but
even so I can't see how we could get to 270 million x 128 terms. (The heap
dump numbers are stable across the index close-and-reopen cycle, so I don't
think we're leaking.)
Thanks,
Chris