|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Customizing Field Score (Multivalued Field)We've had a customized score calculator for one of our fields that we
developed when we were using Lucene instead of Solr (lucene 2.4). During our switch to Solr, we simply continued to use that code. However, as the version of Lucene used in Solr changed to 2.9, somewhere along the way (unfortunately during our last release of code), that customization broke. I'd previously tried to keep it up to date by dealing with deprecation warnings, but managed to break things. Now I'm pretty lost with regards to that code. Our customization is conceptually pretty simple, so rather than try to fix up our code, I'd like some advice on the best way to implement this with Solr 2.4 starting fresh. We have a multi-valued field, where each value is basically the id of a category. Along with the id, there's a score for how well the document fit into that category (between 0.0 and 1.0). I'm looking for that category-score to affect the score of documents when searching on that field. Any suggestions on the best way to attack this in Solr 2.4? Here's how we did it in Lucene: we had an extension of Query, with a custom scorer. In the index we stored the category id's as single-valued space-separated string. We also stored a space-separated string of scores in another field. We made of these fields stored. We simply delegated the search to the normal searcher, then we calculated the score we retrieved the values of both fields for the document. Then we turned the space-separated strings into arrays, searched the id array for the index of the desired id, then scanned the score array for the matching score, and returned. -- Stephen Duncan Jr www.stephenduncanjr.com |
|
|
Re: Customizing Field Score (Multivalued Field): Here's how we did it in Lucene: we had an extension of Query, with a custom : scorer. In the index we stored the category id's as single-valued : space-separated string. We also stored a space-separated string of scores : in another field. We made of these fields stored. We simply delegated the : search to the normal searcher, then we calculated the score we retrieved the : values of both fields for the document. Then we turned the space-separated : strings into arrays, searched the id array for the index of the desired id, : then scanned the score array for the matching score, and returned. oh man, so you were parsing the Stored field values of every matching doc at query time? ouch. Assuming i'm understanding your goal, the conventional way to solve this type of problem is "payloads" ... you'll find lots of discussion on it in the various Lucene mailing lists, and if you look online Michael Busch has various slides that talk about using them. they let you say things like "in this document, at this postion of field 'x' the word 'microsoft' is worth 37.4, but at this other position (or in this other document) 'microsoft' is only worth 17.2" The simplest way to use them in Solr (as i understand it) is to use soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and then write yourself a simple little custom QParser that generates a BoostingTermQuery on your field. should be a lot simpler to implement then the Query you are describing, and much faster. -Hoss |
|
|
Re: Customizing Field Score (Multivalued Field)On Thu, Nov 12, 2009 at 2:54 PM, Chris Hostetter
<hossman_lucene@...>wrote: > > oh man, so you were parsing the Stored field values of every matching doc > at query time? ouch. > > Assuming i'm understanding your goal, the conventional way to solve this > type of problem is "payloads" ... you'll find lots of discussion on it in > the various Lucene mailing lists, and if you look online Michael Busch has > various slides that talk about using them. they let you say things > like "in this document, at this postion of field 'x' the word 'microsoft' > is worth 37.4, but at this other position (or in this other document) > 'microsoft' is only worth 17.2" > > The simplest way to use them in Solr (as i understand it) is to use > soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and > then write yourself > a simple little custom QParser that generates a BoostingTermQuery on your > field. > > should be a lot simpler to implement then the Query you are describing, > and much faster. > > > -Hoss > > at a similar path, so I appreciate the confirmation. -- Stephen Duncan Jr www.stephenduncanjr.com |
|
|
Re: Customizing Field Score (Multivalued Field)On Thu, Nov 12, 2009 at 3:00 PM, Stephen Duncan Jr <stephen.duncan@...
> wrote: > On Thu, Nov 12, 2009 at 2:54 PM, Chris Hostetter <hossman_lucene@... > > wrote: > >> >> oh man, so you were parsing the Stored field values of every matching doc >> at query time? ouch. >> >> Assuming i'm understanding your goal, the conventional way to solve this >> type of problem is "payloads" ... you'll find lots of discussion on it in >> the various Lucene mailing lists, and if you look online Michael Busch has >> various slides that talk about using them. they let you say things >> like "in this document, at this postion of field 'x' the word 'microsoft' >> is worth 37.4, but at this other position (or in this other document) >> 'microsoft' is only worth 17.2" >> >> The simplest way to use them in Solr (as i understand it) is to use >> soemthing like the DelimitedPayloadTokenFilterFactory when indexing, and >> then write yourself >> a simple little custom QParser that generates a BoostingTermQuery on your >> field. >> >> should be a lot simpler to implement then the Query you are describing, >> and much faster. >> >> >> -Hoss >> >> > Thanks. I finally got around to looking at this again today and was looking > at a similar path, so I appreciate the confirmation. > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > For posterity, here's the rest of what I discovered trying to implement this: You'll need to write a PayloadSimilarity as described here: http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/(here's my updated version due to deprecation of the method mentioned in that article): @Override public float scorePayload( int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { // can ignore length here, because we know it is encoded as 4 bytes return PayloadHelper.decodeFloat(payload, offset); } You'll need to register that similarity in your Solr schema.xml (was hard to figure out, as I didn't realize that the similarity has to be applied globally to the writer/search used generally, even though I only care about payloads on one field, so I wasted time trying to figure out how to plug in the similarity in my query parser). You'll want to use the "payloads" type or something based on it that's in the example schema.xml. The latest and greatest query type to use is PayloadTermQuery. I use it in my custom query parser class, overriding getFieldQuery, checking for my field name, and then: return new PayloadTermQuery(new Term(field, queryText), new AveragePayloadFunction()); Due to the global nature of the Similarity, I guess you'd have to modify it to look at the field name and base behavior on that if you wanted different kinds of payloads on different fields in one schema. Also, whereas in my original implementation, I controlled the score completely, and therefore if I set a score of 0.8, the doc came back as score of 0.8, in this technique the payload is just used as a boost/addition to the score, so my scores came out higher than before. Since they're still in the same relative order, that still satisfied my needs, but did require updating my test cases. -- Stephen Duncan Jr www.stephenduncanjr.com |
| Free embeddable forum powered by Nabble | Forum Help |