|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
result grouping?Is it possible to group the results from a solr query? I have indexed
the content from many web pages on many sites. I'd like to return only two results from each site. schema.xml: <field name="uri" type="string" indexed="true" stored="true"/> <field name="site" type="string" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true"/> for example uri: http://en.wikipedia.org/wiki/James_Madison site: wikipedia.org How do i get results grouped by site? Is this possible with the standard query? The website lists: "Support for Dynamic Result Grouping and Filtering." Is it referring to faceted browsing or this? If its not supported off the shelf, what is the best way to implement result grouping? thanks ryan |
|
|
Re: result grouping?Hi,
I don't know if solr can manage grouping. But you can do it using an XSLT stylesheet: http://www.jenitennison.com/xslt/grouping/muenchian.html Hope it helps :) On 1/2/07, Ryan McKinley <ryantxu@...> wrote: > Is it possible to group the results from a solr query? I have indexed > the content from many web pages on many sites. I'd like to return > only two results from each site. > > schema.xml: > > <field name="uri" type="string" indexed="true" stored="true"/> > <field name="site" type="string" indexed="true" stored="true"/> > <field name="content" type="text" indexed="true" stored="true"/> > > for example > uri: http://en.wikipedia.org/wiki/James_Madison > site: wikipedia.org > > How do i get results grouped by site? > > Is this possible with the standard query? The website lists: "Support > for Dynamic Result Grouping and Filtering." Is it referring to > faceted browsing or this? > > If its not supported off the shelf, what is the best way to implement > result grouping? > > thanks > ryan > -- Salut, ==================================== Ricardo Borillo Domenech Analista/Programador - Servei d'Informàtica Universitat Jaume I http://xml-utils.com |
|
|
Re: result grouping?thanks. Yes, the presentation layer could group results, but that is
not practical if i want to show the first 20 results out of 200,000 matches. Nutch groups the results by site. Any idea how they do it? thanks ryan On 1/3/07, Ricardo Borillo <borillo@...> wrote: > Hi, > > I don't know if solr can manage grouping. But you can do it using an XSLT > stylesheet: > > http://www.jenitennison.com/xslt/grouping/muenchian.html > > Hope it helps :) > > > On 1/2/07, Ryan McKinley <ryantxu@...> wrote: > > Is it possible to group the results from a solr query? I have indexed > > the content from many web pages on many sites. I'd like to return > > only two results from each site. > > > > schema.xml: > > > > <field name="uri" type="string" indexed="true" stored="true"/> > > <field name="site" type="string" indexed="true" stored="true"/> > > <field name="content" type="text" indexed="true" stored="true"/> > > > > for example > > uri: http://en.wikipedia.org/wiki/James_Madison > > site: wikipedia.org > > > > How do i get results grouped by site? > > > > Is this possible with the standard query? The website lists: "Support > > for Dynamic Result Grouping and Filtering." Is it referring to > > faceted browsing or this? > > > > If its not supported off the shelf, what is the best way to implement > > result grouping? > > > > thanks > > ryan > > > > > -- > Salut, > ==================================== > Ricardo Borillo Domenech > Analista/Programador - Servei d'Informàtica > Universitat Jaume I > http://xml-utils.com > |
|
|
Re: result grouping?On 1/3/07, Ryan McKinley <ryantxu@...> wrote:
> thanks. Yes, the presentation layer could group results, but that is > not practical if i want to show the first 20 results out of 200,000 > matches. > > Nutch groups the results by site. Any idea how they do it? Good question. Off the top of my head, one could use a priority queue that can change it's size dynamically. One could increment a group count for each hit (like faceted search with the FieldCache) and if the group count exceeds "n", then you increment the size of the priority queue to allow an additional item to be collected to compensate. -Yonik |
|
|
Re: result grouping?Yonik Seeley wrote:
> On 1/3/07, Ryan McKinley <ryantxu@...> wrote: >> thanks. Yes, the presentation layer could group results, but that is >> not practical if i want to show the first 20 results out of 200,000 >> matches. >> >> Nutch groups the results by site. Any idea how they do it? > > Good question. > Off the top of my head, one could use a priority queue that can change > it's size dynamically. One could increment a group count for each hit > (like faceted search with the FieldCache) and if the group count > exceeds "n", then you increment the size of the priority queue to > allow an additional item to be collected to compensate. > > -Yonik You might as wheel say that I have to change the dilithium crystals in the flux capacitor :-) One of the reasons I like Solr so much is because I get impressive results without having to know Lucene, which is something that will have to change because I also need this feature. Not knowing much about the internal of Solr/Lucene I had a look at the Facet code in search of ideas, but from what I could see the facet counts are calculated after the Documents are added to the response, it seems to me that any kind of grouping has to be done before that... right? Could you explain in more detail where should I look? Can the TopFieldDocCollector/TopFieldDocs classes be used to this end? I'm immersing my self on Lucene but it will take some time. Side note: Over here, beside Solr, we also use the "FAST" search platform and they call this feature "Field collapsing": <http://www.fastsearch.com/glossary.aspx?m=48&amid=299> I like the syntax they use: "&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of collapsed documents For some reason they can only collapse on numeric fields (int32). Regards, Luis Neves |
|
|
Re: result grouping?On 1/4/07, Luis Neves <luis.neves@...> wrote:
> Yonik Seeley wrote: > > Off the top of my head, one could use a priority queue that can change > > it's size dynamically. One could increment a group count for each hit > > (like faceted search with the FieldCache) and if the group count > > exceeds "n", then you increment the size of the priority queue to > > allow an additional item to be collected to compensate. > > > > -Yonik > > You might as wheel say that I have to change the dilithium crystals in the flux > capacitor :-) Heh... When someone asks for the top 10 documents, we create a priority queue of size 10 and put all of the hits through it (with a performance shortcut if the only sort is by score). After we are all done, the queue contains the top 10 documents by the sort criteria. Now lets say we are limiting the number of results from any "site" to 2. If we add another document to the priority queue and it will be the 3rd from a specific site, there are two things we could do: 1) remove the lowest ranking document from the 3 documents matching that site 2) increase the size of the priority queue to 11 since we will be throwing one of the documents away later. At first blush, option (2) seemed easier to me, with the added step of discarding the extra documents as you pull them from the queue. > One of the reasons I like Solr so much is because I get impressive results > without having to know Lucene, which is something that will have to change > because I also need this feature. > > Not knowing much about the internal of Solr/Lucene I had a look at the Facet > code in search of ideas, but from what I could see the facet counts are > calculated after the Documents are added to the response, it seems to me that > any kind of grouping has to be done before that... right? Right. > Could you explain in more detail where should I look? > > Can the TopFieldDocCollector/TopFieldDocs classes be used to this end? That's currently how the top docs are collected in Lucene (these separate classes were added later, and Solr doesn't currently use them). SolrIndexSearcher.getDocListNC() is the lowest level of doc collection that would need to be modified or duplicated. > Side note: Over here, beside Solr, we also use the "FAST" search platform and > they call this feature "Field collapsing": > <http://www.fastsearch.com/glossary.aspx?m=48&amid=299> > I like the syntax they use: > "&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of > collapsed documents > For some reason they can only collapse on numeric fields (int32). Cool, thanks for the reference. There are still some things underspecified though. Let's take an example of collapseon=site, collapsenum=2 The list of un-collapsed matches and their relevancy scores (sort order) is: doc=51, site=A, score=100 doc=52, site=B, score=90 doc=53, site=C, score=80 doc=54, site=B, score=70 doc=55, site=D, score=60 doc=56, site=E, score=50 doc=57, site=B, score=40 doc=58, site=A, score=30 1) If I ask for the top 4 docs, should I get [51,52,53,54] or [51,52,54,53]. Are lower ranking docs moved up in the rankings to be in their higher ranking "group"? 2) If I ask for the top 3 docs, should I get [51,52,53] because those are the top 3 scoring docs, or should I get [51,58,52] because documents were first groups and then ranked (and 51 and 58 go together)? Another way of asking this is related to (1): should docs outside the "window" be moved up in the rankings to be in their higher ranking "group"? 3) Should the number of documents in a "group" change the relevancy? Should site=B rank higher than site=A? 4) Is the collapsing only in the returned results, or just within a page of results. If I ask for docs 4 through 7, should doc 57 be in that list or not? Defining things to make sense while retaining the ability to page through the results seems to be the challenge. -Yonik |
|
|
Re: result grouping?Yonik Seeley wrote:
> There are still some things underspecified though. > > Let's take an example of collapseon=site, collapsenum=2 > > The list of un-collapsed matches and their relevancy scores (sort order) > is: > doc=51, site=A, score=100 > doc=52, site=B, score=90 > doc=53, site=C, score=80 > doc=54, site=B, score=70 > doc=55, site=D, score=60 > doc=56, site=E, score=50 > doc=57, site=B, score=40 > doc=58, site=A, score=30 > > 1) If I ask for the top 4 docs, should I get [51,52,53,54] or > [51,52,54,53]. Are lower ranking docs moved up in the rankings to be > in their higher ranking "group"? The docs move up the ranking. You should get [51,58,52,54] ... or one could make the case that you should get [51,58,52,54,53,55], to get the somewhat equivalent behaviour of a SQL "quota-query", in that case that case the "top 4" would not refer to the number of documents but the number of distinct values for the field you are collapsing. > 2) If I ask for the top 3 docs, should I get [51,52,53] because those > are the top 3 scoring docs, or should I get [51,58,52] because > documents were first groups and then ranked (and 51 and 58 go > together)? Another way of asking this is related to (1): should docs > outside the "window" be moved up in the rankings to be in their higher > ranking "group"? See above. > > 3) Should the number of documents in a "group" change the relevancy? > Should site=B rank higher than site=A? I don't think so... don't know if that is what *should* be done, but that's not what FAST does. > 4) Is the collapsing only in the returned results, or just within a > page of results. If I ask for docs 4 through 7, should doc 57 be in > that list or not? With "FAST" that is an option, the default behaviour is to remove the documents from the resultset and the 57 would not be on the list, but you can choose to not remove them and in that case they are presented last. > Defining things to make sense while retaining the ability to page > through the results seems to be the challenge. I'm beginning to think that this a little to complex for a first project with Lucene. In my particular case all I want is to group results by category (from a predetermined - and small - category list), I think I will just make a request by category and accept the latency. -- Luis Neves |
|
|
Re: result grouping?On 1/4/07, Luis Neves <luis.neves@...> wrote:
> Yonik Seeley wrote: > One of the reasons I like Solr so much is because I get impressive results > without having to know Lucene, which is something that will have to change > because I also need this feature. <> > Could you explain in more detail where should I look? > > Can the TopFieldDocCollector/TopFieldDocs classes be used to this end? > > I'm immersing my self on Lucene but it will take some time. We use Solr in a nutch-like manner (index distributed over a collection of servers, results are merged and similar documents collapsed). We have to do the collapsing outside of Solr due to the result combining, but I think it is a viable strategy for a single-instance too. Just slightly over-request the desired number of docs, collapse using arbitrary logic, and request more if necessary. The main disadvantage is if the user skips ahead several pages, all the intermediate results must be generated. -Mike |
|
|
Re: result grouping?On 1/5/07, Luis Neves <luis.neves@...> wrote:
> Yonik Seeley wrote: > > > There are still some things underspecified though. > > > > Let's take an example of collapseon=site, collapsenum=2 > > > > The list of un-collapsed matches and their relevancy scores (sort order) > > is: > > doc=51, site=A, score=100 > > doc=52, site=B, score=90 > > doc=53, site=C, score=80 > > doc=54, site=B, score=70 > > doc=55, site=D, score=60 > > doc=56, site=E, score=50 > > doc=57, site=B, score=40 > > doc=58, site=A, score=30 > > > > 1) If I ask for the top 4 docs, should I get [51,52,53,54] or > > [51,52,54,53]. Are lower ranking docs moved up in the rankings to be > > in their higher ranking "group"? > > The docs move up the ranking. After thinking on this a little further (since someone submitted a patch), this makes things significantly more expensive. The issue is that even if you are only interested in the top 10 docs, you can't use the normal priority queue method to discard low scores, because the last document you score could be very high scoring, and be in the same group as the lower previously-discarded scores. One way is to keep a priority queue per field value (very expensive if there are many field values). Another way is to use two phases... the first collects the top n documents, and the second grabs Another issue is how to implement start + offset. -Yonik |
|
|
Re: result grouping?On 6/4/07, Yonik Seeley <yonik@...> wrote:
> Another way is to use two phases... the first collects the top n > documents, and the second grabs ... other members of each group in the list of docs to return. -Yonik |
| Free embeddable forum powered by Nabble | Forum Help |