|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
tf*idf scoringHello list,
I have a question about Lucene's calculation of tf*idf value. I first noticed that Solr's tf does not compare to tf values based on calculation elsewhere such as http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item% 3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf The tf values returned by Solr are always integers and not normalized against the length of the corpus whilst the field in which it resides does not have omitNorms="true". Consider the following documents where the field subject is of the standard text_ws type: <result name="response" numFound="6" start="0"> <doc> <str name="subject">a b c</str> </doc> <doc> <str name="subject">d e f</str> </doc> <doc> <str name="subject">x y z</str> </doc> <doc> <str name="subject">a d x</str> </doc> <doc> <str name="subject">a e z</str> </doc> <doc> <str name="subject">c f z</str> </doc> </result> Now, Solr's TermVector results for the first document: <lst name="doc-0"> <str name="uniqueKey">0</str> <lst name="subject"> <lst name="a"> <int name="tf">1</int> <lst name="positions"> <int name="position">0</int> </lst> <int name="df">3</int> <double name="tf-idf">0.3333333333333333</double> </lst> <lst name="b"> <int name="tf">1</int> <lst name="positions"> <int name="position">1</int> </lst> <int name="df">1</int> <double name="tf-idf">1.0</double> </lst> <lst name="c"> <int name="tf">1</int> <lst name="positions"> <int name="position">2</int> </lst> <int name="df">2</int> <double name="tf-idf">0.5</double> </lst> </lst> </lst> According to different algorithms, the tf for term c would be 3 / 1 = 0.33 instead of 1 returned by Solr. Also, the tf*idf value i get is 0.5 for term c and i get 0.333 for term a. It looks like tf*idf is quotient of document frequency and term frequency. If i calculate tf*idf, for term c in the first document, according to other algorithms it would be: tf = 3 / 1 = 0.333 idf = ln(6 / 3) = 1.0986 tf*idf = 0.333 * 1.0986 = 0.3658 Can someone explain either the difference demonstrated or tell me what i am possibly doing wrong? Cheers, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 |
|
|
Re: tf*idf scoringInline below
On Nov 3, 2009, at 2:30 AM, Markus Jelsma - Buyways B.V. wrote: > Hello list, > > > I have a question about Lucene's calculation of tf*idf value. I first > noticed that Solr's tf does not compare to tf values based on > calculation elsewhere such as > http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item% > 3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf > > The tf values returned by Solr are always integers and not normalized > against the length of the corpus whilst the field in which it resides > does not have omitNorms="true". > > Consider the following documents where the field subject is of the > standard text_ws type: > > <result name="response" numFound="6" start="0"> > <doc> > <str name="subject">a b c</str> > </doc> > <doc> > <str name="subject">d e f</str> > </doc> > <doc> > <str name="subject">x y z</str> > </doc> > <doc> > <str name="subject">a d x</str> > </doc> > <doc> > <str name="subject">a e z</str> > </doc> > <doc> > <str name="subject">c f z</str> > </doc> > </result> > > Now, Solr's TermVector results for the first document: > > <lst name="doc-0"> > <str name="uniqueKey">0</str> > <lst name="subject"> > <lst name="a"> > <int name="tf">1</int> > <lst name="positions"> > <int name="position">0</int> > </lst> > <int name="df">3</int> > <double name="tf-idf">0.3333333333333333</double> > </lst> > <lst name="b"> > <int name="tf">1</int> > <lst name="positions"> > <int name="position">1</int> > </lst> > <int name="df">1</int> > <double name="tf-idf">1.0</double> > </lst> > <lst name="c"> > <int name="tf">1</int> > <lst name="positions"> > <int name="position">2</int> > </lst> > <int name="df">2</int> > <double name="tf-idf">0.5</double> > </lst> > </lst> > </lst> > > > According to different algorithms, the tf for term c would be 3 / 1 = > 0.33 instead of 1 returned by Solr. I don't follow. The TF (term frequency) is the number of times the term c occurs in that particular document, i.e. 1 time. > Also, the tf*idf value i get is 0.5 > for term c and i get 0.333 for term a. It looks like tf*idf is > quotient > of document frequency and term frequency. Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF. > > If i calculate tf*idf, for term c in the first document, according to > other algorithms it would be: > > tf = 3 / 1 = 0.333 3/1 = 3, no? I don't see where in your docs above you could even get a 3 for the letter c. > idf = ln(6 / 3) = 1.0986 > tf*idf = 0.333 * 1.0986 = 0.3658 > I think the formulas you are looking at are doing operations to normalize the values, whereas the Solr/Lucene stuff above is telling you their raw values. Note, Lucene/Solr does length normalization, etc. too, it just isn't encoded into the TF or DF. For more on Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search |
|
|
Re: tf*idf scoring> > > > > > According to different algorithms, the tf for term c would be 3 / 1 = > > 0.33 instead of 1 returned by Solr. > > I don't follow. The TF (term frequency) is the number of times the > term c occurs in that particular document, i.e. 1 time. I see that above, and below, i made some typo's. I wrote 3 / 1 = 0.3 instead of 1 / 3 = 0.33. Term c has a #occurences of 1 which the other algorithms normalize by dividing by the number of terms. So instead of a tf = #occurences (1) other algorithms do tf = #occurences / #terms (0.33). > > > Also, the tf*idf value i get is 0.5 > > for term c and i get 0.333 for term a. It looks like tf*idf is > > quotient > > of document frequency and term frequency. > > Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF. Indeed, but most algorithms i have seen on this topic calculate idf by ln(#docs / df), this is also true for Lucene as i read http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Similarity.html idf(t) = 1 + log (numDocs / df + 1) > > > > > If i calculate tf*idf, for term c in the first document, according to > > other algorithms it would be: > > > > tf = 3 / 1 = 0.333 > > 3/1 = 3, no? I don't see where in your docs above you could even get > a 3 for the letter c. Here's the other typo, i wrote again 3 / 1 = 0.33 what should've been 1 / 3 = 0.33, of course. The differences i see are: tf (solr) = #occurences_of_term_T in document_D tf (other) = #occurences_of_term_T in document_D / #terms_document_D df (solr) = #occurences_of_term_T in all_documents df (other) = #occurences_of_term_T in all_documents idf (solr) = tf / df idf (other) = ln(#documents / df) tf*idf (solr) = tf / df tf*idf (other) = tf * idf > > > idf = ln(6 / 3) = 1.0986 > > tf*idf = 0.333 * 1.0986 = 0.3658 > > > > I think the formulas you are looking at are doing operations to > normalize the values, whereas the Solr/Lucene stuff above is telling > you their raw values. Note, Lucene/Solr does length normalization, > etc. too, it just isn't encoded into the TF or DF. For more on > Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html > I see, but why not return the true values of Lucene? I did not reconfigure Solr's scheme to use another algorithm for similarity and the above Lucene similarity docs state that they use similar calculations as i have in DefaultSimilarty. > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > |
|
|
Re: tf*idf scoringOn Nov 3, 2009, at 5:54 AM, Markus Jelsma - Buyways B.V. wrote: > > > I see, but why not return the true values of Lucene? I'm not sure what you mean by this. The TVC returns the term frequency and the document frequency and TF/DF as reported by Lucene. The actual raw values. What you are asking for is for the TVC to return some other normalized values above and beyond the literal interpretation TF/IDF. This can be done, it's not particularly hard, but it will require a patch or you can just do it in your application. I personally don't think the TVC should do it b/ c there are other calculations/interpretations that one might do beyond/besides what you propose, so I'd rather just give back the raw data and let the user decide. |
|
|
Re: tf*idf scoringThank you for your explanation
On Tue, 2009-11-03 at 07:32 -0800, Grant Ingersoll wrote: > On Nov 3, 2009, at 5:54 AM, Markus Jelsma - Buyways B.V. wrote: > > > > > > I see, but why not return the true values of Lucene? > > I'm not sure what you mean by this. The TVC returns the term > frequency and the document frequency and TF/DF as reported by > Lucene. The actual raw values. What you are asking for is for > the > TVC to return some other normalized values above and beyond the > literal interpretation TF/IDF. This can be done, it's not > particularly hard, but it will require a patch or you can just do it > in your application. I personally don't think the TVC should do it > b/ > c there are other calculations/interpretations that one might do > beyond/besides what you propose, so I'd rather just give back the > raw > data and let the user decide. |
| Free embeddable forum powered by Nabble | Forum Help |