Omit positions but not TF

View: New views
6 Messages — Rating Filter:   Alert me  

Omit positions but not TF

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

During one of discussions at ApacheCon it occurred to me that it would
be useful to have an option to discard positional information but still
keep the term frequency. Even though position-dependent queries wouldn't
work then, still any other queries would work fine and we would get the
right scoring.

I believe it should be possible to do this without changing the file
format, if we used a negative term frequency for terms without postings
- we would have to check for that condition in SegmentTermDocs, change
the flags there and flip the sign of docFreq. And eventually we may want
to add a separate flag for this and bump the format version.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: Omit positions but not TF

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

+1

I guess we'd add a Fieldable.setOmitPositions?  And then save that in
FieldInfos, and fix the postings writing/reading to respect it?  Ie,
we can just change the index format.  Encoding as negative numbers
isn't great because the termFreq is written as a vInt, which consumes
5 bytes to encode any negative number.  Wanna cough up a patch?
Probably this should wait until 3.1.

Mike

On Sat, Nov 7, 2009 at 7:47 PM, Andrzej Bialecki <ab@...> wrote:

> Hi,
>
> During one of discussions at ApacheCon it occurred to me that it would be
> useful to have an option to discard positional information but still keep
> the term frequency. Even though position-dependent queries wouldn't work
> then, still any other queries would work fine and we would get the right
> scoring.
>
> I believe it should be possible to do this without changing the file format,
> if we used a negative term frequency for terms without postings - we would
> have to check for that condition in SegmentTermDocs, change the flags there
> and flip the sign of docFreq. And eventually we may want to add a separate
> flag for this and bump the format version.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@...
> For additional commands, e-mail: java-dev-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: Omit positions but not TF

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Michael McCandless wrote:
> +1
>
> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
> we can just change the index format.  Encoding as negative numbers

Yes, that's what I had in mind. I was a bit shy of bumping the format
version, but likely there will be other changes that we can put under
the same next version of the format.

> isn't great because the termFreq is written as a vInt, which consumes
> 5 bytes to encode any negative number.  Wanna cough up a patch?

Heh .. that's the right term for it, I haven't looked at the details of
oal.index.* since 2.4-ish or so ... we'll see ;)

> Probably this should wait until 3.1.

+1.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: Omit positions but not TF

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Andrzej Bialecki wrote:

> Michael McCandless wrote:
>> +1
>>
>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>> we can just change the index format.  Encoding as negative numbers
>
> Yes, that's what I had in mind. I was a bit shy of bumping the format
> version, but likely there will be other changes that we can put under
> the same next version of the format.
>
>> isn't great because the termFreq is written as a vInt, which consumes
>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>
> Heh .. that's the right term for it, I haven't looked at the details of
> oal.index.* since 2.4-ish or so ... we'll see ;)

Ehh, sorry - I think I'll give up for now, after looking at the
combinatoric increase in the number of arguments to various indexing
classes ...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: Omit positions but not TF

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

How about opening an issue?  This way someone else can come along and
pick up the torch...

Mike

On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab@...> wrote:

> Andrzej Bialecki wrote:
>>
>> Michael McCandless wrote:
>>>
>>> +1
>>>
>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>> we can just change the index format.  Encoding as negative numbers
>>
>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>> version, but likely there will be other changes that we can put under the
>> same next version of the format.
>>
>>> isn't great because the termFreq is written as a vInt, which consumes
>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>
>> Heh .. that's the right term for it, I haven't looked at the details of
>> oal.index.* since 2.4-ish or so ... we'll see ;)
>
> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
> increase in the number of arguments to various indexing classes ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@...
> For additional commands, e-mail: java-dev-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: Omit positions but not TF

by Simon Willnauer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 9, 2009 at 6:03 PM, Michael McCandless
<lucene@...> wrote:
> How about opening an issue?  This way someone else can come along and
> pick up the torch...
+1

>
> Mike
>
> On Mon, Nov 9, 2009 at 11:26 AM, Andrzej Bialecki <ab@...> wrote:
>> Andrzej Bialecki wrote:
>>>
>>> Michael McCandless wrote:
>>>>
>>>> +1
>>>>
>>>> I guess we'd add a Fieldable.setOmitPositions?  And then save that in
>>>> FieldInfos, and fix the postings writing/reading to respect it?  Ie,
>>>> we can just change the index format.  Encoding as negative numbers
>>>
>>> Yes, that's what I had in mind. I was a bit shy of bumping the format
>>> version, but likely there will be other changes that we can put under the
>>> same next version of the format.
>>>
>>>> isn't great because the termFreq is written as a vInt, which consumes
>>>> 5 bytes to encode any negative number.  Wanna cough up a patch?
>>>
>>> Heh .. that's the right term for it, I haven't looked at the details of
>>> oal.index.* since 2.4-ish or so ... we'll see ;)
>>
>> Ehh, sorry - I think I'll give up for now, after looking at the combinatoric
>> increase in the number of arguments to various indexing classes ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@...
>> For additional commands, e-mail: java-dev-help@...
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@...
> For additional commands, e-mail: java-dev-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...