|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
Omit positions but not TFHi,
During one of discussions at ApacheCon it occurred to me that it would be useful to have an option to discard positional information but still keep the term frequency. Even though position-dependent queries wouldn't work then, still any other queries would work fine and we would get the right scoring. I believe it should be possible to do this without changing the file format, if we used a negative term frequency for terms without postings - we would have to check for that condition in SegmentTermDocs, change the flags there and flip the sign of docFreq. And eventually we may want to add a separate flag for this and bump the format version. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: Omit positions but not TF+1
I guess we'd add a Fieldable.setOmitPositions? And then save that in FieldInfos, and fix the postings writing/reading to respect it? Ie, we can just change the index format. Encoding as negative numbers isn't great because the termFreq is written as a vInt, which consumes 5 bytes to encode any negative number. Wanna cough up a patch? Probably this should wait until 3.1. Mike On Sat, Nov 7, 2009 at 7:47 PM, Andrzej Bialecki <ab@...> wrote: > Hi, > > During one of discussions at ApacheCon it occurred to me that it would be > useful to have an option to discard positional information but still keep > the term frequency. Even though position-dependent queries wouldn't work > then, still any other queries would work fine and we would get the right > scoring. > > I believe it should be possible to do this without changing the file format, > if we used a negative term frequency for terms without postings - we would > have to check for that condition in SegmentTermDocs, change the flags there > and flip the sign of docFreq. And eventually we may want to add a separate > flag for this and bump the format version. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: Omit positions but not TFMichael McCandless wrote:
> +1 > > I guess we'd add a Fieldable.setOmitPositions? And then save that in > FieldInfos, and fix the postings writing/reading to respect it? Ie, > we can just change the index format. Encoding as negative numbers Yes, that's what I had in mind. I was a bit shy of bumping the format version, but likely there will be other changes that we can put under the same next version of the format. > isn't great because the termFreq is written as a vInt, which consumes > 5 bytes to encode any negative number. Wanna cough up a patch? Heh .. that's the right term for it, I haven't looked at the details of oal.index.* since 2.4-ish or so ... we'll see ;) > Probably this should wait until 3.1. +1. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: Omit positions but not TFAndrzej Bialecki wrote:
> Michael McCandless wrote: >> +1 >> >> I guess we'd add a Fieldable.setOmitPositions? And then save that in >> FieldInfos, and fix the postings writing/reading to respect it? Ie, >> we can just change the index format. Encoding as negative numbers > > Yes, that's what I had in mind. I was a bit shy of bumping the format > version, but likely there will be other changes that we can put under > the same next version of the format. > >> isn't great because the termFreq is written as a vInt, which consumes >> 5 bytes to encode any negative number. Wanna cough up a patch? > > Heh .. that's the right term for it, I haven't looked at the details of > oal.index.* since 2.4-ish or so ... we'll see ;) Ehh, sorry - I think I'll give up for now, after looking at the combinatoric increase in the number of arguments to various indexing classes ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: Omit positions but not TFHow about opening an issue? This way someone else can come along and
pick up the torch... Mike On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab@...> wrote: > Andrzej Bialecki wrote: >> >> Michael McCandless wrote: >>> >>> +1 >>> >>> I guess we'd add a Fieldable.setOmitPositions? And then save that in >>> FieldInfos, and fix the postings writing/reading to respect it? Ie, >>> we can just change the index format. Encoding as negative numbers >> >> Yes, that's what I had in mind. I was a bit shy of bumping the format >> version, but likely there will be other changes that we can put under the >> same next version of the format. >> >>> isn't great because the termFreq is written as a vInt, which consumes >>> 5 bytes to encode any negative number. Wanna cough up a patch? >> >> Heh .. that's the right term for it, I haven't looked at the details of >> oal.index.* since 2.4-ish or so ... we'll see ;) > > Ehh, sorry - I think I'll give up for now, after looking at the combinatoric > increase in the number of arguments to various indexing classes ... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: Omit positions but not TFOn Mon, Nov 9, 2009 at 6:03 PM, Michael McCandless
<lucene@...> wrote: > How about opening an issue? This way someone else can come along and > pick up the torch... +1 > > Mike > > On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab@...> wrote: >> Andrzej Bialecki wrote: >>> >>> Michael McCandless wrote: >>>> >>>> +1 >>>> >>>> I guess we'd add a Fieldable.setOmitPositions? And then save that in >>>> FieldInfos, and fix the postings writing/reading to respect it? Ie, >>>> we can just change the index format. Encoding as negative numbers >>> >>> Yes, that's what I had in mind. I was a bit shy of bumping the format >>> version, but likely there will be other changes that we can put under the >>> same next version of the format. >>> >>>> isn't great because the termFreq is written as a vInt, which consumes >>>> 5 bytes to encode any negative number. Wanna cough up a patch? >>> >>> Heh .. that's the right term for it, I haven't looked at the details of >>> oal.index.* since 2.4-ish or so ... we'll see ;) >> >> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric >> increase in the number of arguments to various indexing classes ... >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@... >> For additional commands, e-mail: java-dev-help@... >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@... > For additional commands, e-mail: java-dev-help@... > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| Free embeddable forum powered by Nabble | Forum Help |