Re: Boolean retrieval

View: New views
10 Messages — Rating Filter:   Alert me  

Re: Boolean retrieval

by markharw00d :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery



On 4 Jul 2009, at 17:37, Lukas Michelbacher <michells@...> wrote:


This is about an experiment comparing plain Boolean retrieval with
vector-space-based retrieval.

I would like to disable all of Lucene's scoring mechanisms and just
run a true Boolean query that returns exactly the documents that match a
query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
required.

As far as I can see, this is not supported out of the box.  Which classes
would I have to modify?

Would it be enough to create a subclass of Similarity and to ignore all terms but one (coord, say) and make this term return 1 if the query matches the document and 0 otherwise?

Lukas

--
Lukas Michelbacher
Institute for Natural Language Processing
Universität Stuttgart
email: michells@...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by Paul Elschot :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It is also possible to use the HitCollector api and simply ignore
the score values.

Regards,
Paul Elschot


On Saturday 04 July 2009 21:14:41 Mark Harwood wrote:

>
> Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery
>
>
>
> On 4 Jul 2009, at 17:37, Lukas Michelbacher <michells@...> wrote:
>
>
> This is about an experiment comparing plain Boolean retrieval with
> vector-space-based retrieval.
>
> I would like to disable all of Lucene's scoring mechanisms and just
> run a true Boolean query that returns exactly the documents that match a
> query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
> required.
>
> As far as I can see, this is not supported out of the box.  Which classes
> would I have to modify?
>
> Would it be enough to create a subclass of Similarity and to ignore all terms but one (coord, say) and make this term return 1 if the query matches the document and 0 otherwise?
>
> Lukas
>
> --
> Lukas Michelbacher
> Institute for Natural Language Processing
> Universität Stuttgart
> email: michells@...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>
>
>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>
>


Parent Message unknown Re: Boolean retrieval

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As of 2.9 (not yet released) the new Collector API allows you to skip
scoring entirely and just collect the doc IDs matching the query.

Mike

On Sat, Jul 4, 2009 at 12:37 PM, Lukas
Michelbacher<michells@...> wrote:

>
> This is about an experiment comparing plain Boolean retrieval with
> vector-space-based retrieval.
>
> I would like to disable all of Lucene's scoring mechanisms and just
> run a true Boolean query that returns exactly the documents that match a
> query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
> required.
>
> As far as I can see, this is not supported out of the box.  Which classes
> would I have to modify?
>
> Would it be enough to create a subclass of Similarity and to ignore all
> terms but one (coord, say) and make this term return 1 if the query matches
> the document and 0 otherwise?
>
> Lukas
>
> --
> Lukas Michelbacher
> Institute for Natural Language Processing
> Universität Stuttgart
> email: michells@...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by Lukas Michelbacher-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

To test my Boolean queries, I have a small test collection where each document
contains one of 1024 possible combinations of the strings "aaa", "bbb",
... "jjj".  I tried wrapping a Boolean query like this (it's based on an
older post to this list [1])


private static TermsFilter getTermsFilter(String field, String text) {
  TermsFilter tf = new TermsFilter();
  tf.addTerm(new Term(field, text));
  return tf;
}

Query q = new QueryParser("f1", new StandardAnalyzer()).parse("(aaa
AND bbb) OR ccc");
IndexSearcher searcher = new IndexSearcher(indexDir);
TopDocCollector collector = new TopDocCollector(1024);

BooleanQuery bc = (BooleanQuery) q;
BooleanFilter finalFilter = new BooleanFilter();
BooleanFilter boolFilt = new BooleanFilter();

// add each clause of the original query to the filter
for (BooleanClause clause : bc.getClauses()) {
  boolFilt.add(new FilterClause(getTermsFilter("f1",
clause.getQuery().toString()), clause.getOccur()));
  System.out.println(clause.getQuery().toString());
}

finalFilter.add(new FilterClause(boolFilt, BooleanClause.Occur.MUST));

ConstantScoreQuery csq = new ConstantScoreQuery(finalFilter);
searcher.search(csq, finalFilter, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + collector.getTotalHits() + " hits");

The result is 0 hits (should be 640).

[1] tinyurl.com/ml52ye

2009/7/4 Mark Harwood <markharw00d@...>:

>
> Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery
>
>
>
> On 4 Jul 2009, at 17:37, Lukas Michelbacher <michells@...> wrote:
>
>
> This is about an experiment comparing plain Boolean retrieval with
> vector-space-based retrieval.
>
> I would like to disable all of Lucene's scoring mechanisms and just
> run a true Boolean query that returns exactly the documents that match a
> query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
> required.
>
> As far as I can see, this is not supported out of the box.  Which classes
> would I have to modify?
>
> Would it be enough to create a subclass of Similarity and to ignore all terms but one (coord, say) and make this term return 1 if the query matches the document and 0 otherwise?
>
> Lukas
>
> --
> Lukas Michelbacher
> Institute for Natural Language Processing
> Universität Stuttgart
> email: michells@...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by markharw00d :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Seems a long-winded way of producing a BooleanFilter but I guess you are trying to work with user input in the form of query strings.

The bug in your code is that clause.getQuery().getString() is not producing terms that are in your index - the first call to getTermsFilter passes the string  "+f1:aaa +f1:bbb" which is not a term in the index.

Given the requirement is to ignore scoring I would recommend (as someone else suggested) looking at the IndexSearch.search method that takes a HitCollector and simply accumulate all results, regardless of score.





----- Original Message ----
From: Lukas Michelbacher <mmmasterluke@...>
To: java-user@...
Sent: Tuesday, 7 July, 2009 9:53:24
Subject: Re: Boolean retrieval

To test my Boolean queries, I have a small test collection where each document
contains one of 1024 possible combinations of the strings "aaa", "bbb",
... "jjj".  I tried wrapping a Boolean query like this (it's based on an
older post to this list [1])


private static TermsFilter getTermsFilter(String field, String text) {
  TermsFilter tf = new TermsFilter();
  tf.addTerm(new Term(field, text));
  return tf;
}

Query q = new QueryParser("f1", new StandardAnalyzer()).parse("(aaa
AND bbb) OR ccc");
IndexSearcher searcher = new IndexSearcher(indexDir);
TopDocCollector collector = new TopDocCollector(1024);

BooleanQuery bc = (BooleanQuery) q;
BooleanFilter finalFilter = new BooleanFilter();
BooleanFilter boolFilt = new BooleanFilter();

// add each clause of the original query to the filter
for (BooleanClause clause : bc.getClauses()) {
  boolFilt.add(new FilterClause(getTermsFilter("f1",
clause.getQuery().toString()), clause.getOccur()));
  System.out.println(clause.getQuery().toString());
}

finalFilter.add(new FilterClause(boolFilt, BooleanClause.Occur.MUST));

ConstantScoreQuery csq = new ConstantScoreQuery(finalFilter);
searcher.search(csq, finalFilter, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + collector.getTotalHits() + " hits");

The result is 0 hits (should be 640).

[1] tinyurl.com/ml52ye

2009/7/4 Mark Harwood <markharw00d@...>:

>
> Check out booleanfilter in contrib/queries. It can be wrapped in a constantScoreQuery
>
>
>
> On 4 Jul 2009, at 17:37, Lukas Michelbacher <michells@...> wrote:
>
>
> This is about an experiment comparing plain Boolean retrieval with
> vector-space-based retrieval.
>
> I would like to disable all of Lucene's scoring mechanisms and just
> run a true Boolean query that returns exactly the documents that match a
> query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
> required.
>
> As far as I can see, this is not supported out of the box.  Which classes
> would I have to modify?
>
> Would it be enough to create a subclass of Similarity and to ignore all terms but one (coord, say) and make this term return 1 if the query matches the document and 0 otherwise?
>
> Lukas
>
> --
> Lukas Michelbacher
> Institute for Natural Language Processing
> Universität Stuttgart
> email: michells@...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by Michael McCandless-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Jul 7, 2009 at 5:39 AM, mark harwood<markharw00d@...> wrote:

> Given the requirement is to ignore scoring I would recommend (as someone else suggested) looking at the IndexSearch.search method that takes a HitCollector and simply accumulate all results, regardless of score.

Make that "Collector" (new as of 2.9).

HitCollector is the old (deprecated as of 2.9) way, which always
pre-computed the score of each hit and passed the score to the collect
method.

Whereas Collector makes scoring optional (your collect method must
actually request the score).

EG sorting by field makes use of this (in 2.9) by making scoring optional.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by Lukas Michelbacher-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Seems a long-winded way of producing a BooleanFilter but I guess you are trying to work with user input in the form of query strings.

Yes I am. I had the same impression but I couldn't figure out a more
straightforward way.

> The bug in your code is that clause.getQuery().getString() is not producing terms that are in your index - the first call to getTermsFilter passes the string  "+f1:aaa +f1:bbb" which is not a term in the index.

OK, thanks. I'll have look at that.

> Given the requirement is to ignore scoring I would recommend (as someone else suggested) looking at the IndexSearch.search method that takes a HitCollector and simply accumulate all results, regardless of score.

I think that's what I'm going to end up doing. I was just curious to
know what was wrong with my initial approach.

Thanks for your comments.

Lukas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by tsuraan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Make that "Collector" (new as of 2.9).
>
> HitCollector is the old (deprecated as of 2.9) way, which always
> pre-computed the score of each hit and passed the score to the collect
> method.

Where can I find docs for 2.9?  Do I just have to check out the lucene
trunk and run javadoc there?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by Koji Sekiguchi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

tsuraan wrote:

>> Make that "Collector" (new as of 2.9).
>>
>> HitCollector is the old (deprecated as of 2.9) way, which always
>> pre-computed the score of each hit and passed the score to the collect
>> method.
>>    
>
> Where can I find docs for 2.9?  Do I just have to check out the lucene
> trunk and run javadoc there?
>
>  
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/index.html

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Boolean retrieval

by tsuraan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/index.html
>
> Koji

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...