On 1/4/07, Luis Neves <
luis.neves@...> wrote:
> Yonik Seeley wrote:
> > Off the top of my head, one could use a priority queue that can change
> > it's size dynamically. One could increment a group count for each hit
> > (like faceted search with the FieldCache) and if the group count
> > exceeds "n", then you increment the size of the priority queue to
> > allow an additional item to be collected to compensate.
> >
> > -Yonik
>
> You might as wheel say that I have to change the dilithium crystals in the flux
> capacitor :-)
Heh...
When someone asks for the top 10 documents, we create a priority queue
of size 10 and put all of the hits through it (with a performance
shortcut if the only sort is by score). After we are all done, the
queue contains the top 10 documents by the sort criteria.
Now lets say we are limiting the number of results from any "site" to 2.
If we add another document to the priority queue and it will be the
3rd from a specific site, there are two things we could do:
1) remove the lowest ranking document from the 3 documents matching that site
2) increase the size of the priority queue to 11 since we will be
throwing one of the
documents away later.
At first blush, option (2) seemed easier to me, with the added step of
discarding the extra documents as you pull them from the queue.
> One of the reasons I like Solr so much is because I get impressive results
> without having to know Lucene, which is something that will have to change
> because I also need this feature.
>
> Not knowing much about the internal of Solr/Lucene I had a look at the Facet
> code in search of ideas, but from what I could see the facet counts are
> calculated after the Documents are added to the response, it seems to me that
> any kind of grouping has to be done before that... right?
Right.
> Could you explain in more detail where should I look?
>
> Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?
That's currently how the top docs are collected in Lucene (these
separate classes were added later, and Solr doesn't currently use
them).
SolrIndexSearcher.getDocListNC() is the lowest level of doc collection
that would need to be modified or duplicated.
> Side note: Over here, beside Solr, we also use the "FAST" search platform and
> they call this feature "Field collapsing":
> <
http://www.fastsearch.com/glossary.aspx?m=48&amid=299>
> I like the syntax they use:
> "&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of
> collapsed documents
> For some reason they can only collapse on numeric fields (int32).
Cool, thanks for the reference.
There are still some things underspecified though.
Let's take an example of collapseon=site, collapsenum=2
The list of un-collapsed matches and their relevancy scores (sort order) is:
doc=51, site=A, score=100
doc=52, site=B, score=90
doc=53, site=C, score=80
doc=54, site=B, score=70
doc=55, site=D, score=60
doc=56, site=E, score=50
doc=57, site=B, score=40
doc=58, site=A, score=30
1) If I ask for the top 4 docs, should I get [51,52,53,54] or
[51,52,54,53]. Are lower ranking docs moved up in the rankings to be
in their higher ranking "group"?
2) If I ask for the top 3 docs, should I get [51,52,53] because those
are the top 3 scoring docs, or should I get [51,58,52] because
documents were first groups and then ranked (and 51 and 58 go
together)? Another way of asking this is related to (1): should docs
outside the "window" be moved up in the rankings to be in their higher
ranking "group"?
3) Should the number of documents in a "group" change the relevancy?
Should site=B rank higher than site=A?
4) Is the collapsing only in the returned results, or just within a
page of results. If I ask for docs 4 through 7, should doc 57 be in
that list or not?
Defining things to make sense while retaining the ability to page
through the results seems to be the challenge.
-Yonik