|
View:
New views
17 Messages
—
Rating Filter:
Alert me
|
|
|
Group by in Lucene ?Hi.
I have a situation where I'm searching amongst some 100K feeds and only want one result per site in return. I have developed a really simple method of grouping which just scrolls through the resultset(hitset) until a maxNum docs of feeds from a set of unique sites is populated. Since I don't wanna reinvent the wheel, I want to know if Lucene has something like this built. I as well will use Solr soon and then my own homecooked recipe will not work so I really need a standard way of doing this. I know Nutch has something like it called depupField which default is set to 2. Anyone? Kindly //Marcus -- Marcus Herou Solution Architect & Core Java developer Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com |
|
|
Re: Group by in Lucene ?Solr has an issue outstanding right now that implements something that
may be close to what you want. They are calling it Field Collapsing. See https://issues.apache.org/jira/browse/SOLR-236 -Grant On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote: > Hi. > > I have a situation where I'm searching amongst some 100K feeds and > only want > one result per site in return. I have developed a really simple > method of > grouping which just scrolls through the resultset(hitset) until a > maxNum > docs of feeds from a set of unique sites is populated. Since I don't > wanna > reinvent the wheel, I want to know if Lucene has something like this > built. > I as well will use Solr soon and then my own homecooked recipe will > not work > so I really need a standard way of doing this. > > I know Nutch has something like it called depupField which default > is set to > 2. > > Anyone? > > > Kindly > > //Marcus > > -- > Marcus Herou Solution Architect & Core Java developer Tailsweep AB > +46702561312 > marcus.herou@... > http://www.tailsweep.com -------------------------- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@... For additional commands, e-mail: java-user-help@... |
|
|
Re: Group by in Lucene ?Thanks. They seem to have got real far in the dev cycle on this. Seems like
it will hit the road in Solr 1.3. However I would really like this feature to be developed for Core Lucene, how do I start that process? Develop it yourself you would say :) I'm serious isn't it a really cool and useful feature ? Kindly //Marcus On 11/5/07, Grant Ingersoll <gsingers@...> wrote: > > Solr has an issue outstanding right now that implements something that > may be close to what you want. They are calling it Field Collapsing. > See https://issues.apache.org/jira/browse/SOLR-236 > > -Grant > > On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote: > > > Hi. > > > > I have a situation where I'm searching amongst some 100K feeds and > > only want > > one result per site in return. I have developed a really simple > > method of > > grouping which just scrolls through the resultset(hitset) until a > > maxNum > > docs of feeds from a set of unique sites is populated. Since I don't > > wanna > > reinvent the wheel, I want to know if Lucene has something like this > > built. > > I as well will use Solr soon and then my own homecooked recipe will > > not work > > so I really need a standard way of doing this. > > > > I know Nutch has something like it called depupField which default > > is set to > > 2. > > > > Anyone? > > > > > > Kindly > > > > //Marcus > > > > -- > > Marcus Herou Solution Architect & Core Java developer Tailsweep AB > > +46702561312 > > marcus.herou@... > > http://www.tailsweep.com > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Boot Camp Training: > ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > -- Marcus Herou Solution Architect & Core Java developer Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com |
|
|
Re: Group by in Lucene ?On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote: > Thanks. They seem to have got real far in the dev cycle on this. > Seems like > it will hit the road in Solr 1.3. > > However I would really like this feature to be developed for Core > Lucene, > how do I start that process? > Develop it yourself you would say :) I'm serious isn't it a really > cool and > useful feature ? We're always open to well-thought out and tested patches. See the Wiki for info on contributing. -Grant -------------------------- Grant Ingersoll http://lucene.grantingersoll.com Lucene Boot Camp Training: ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@... For additional commands, e-mail: java-user-help@... |
|
|
Re: Group by in Lucene ?Cool.
I'll do since this is a field which I can spend time in. Kindly //Marcus On 11/5/07, Grant Ingersoll <gsingers@...> wrote: > > > On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote: > > > Thanks. They seem to have got real far in the dev cycle on this. > > Seems like > > it will hit the road in Solr 1.3. > > > > However I would really like this feature to be developed for Core > > Lucene, > > how do I start that process? > > Develop it yourself you would say :) I'm serious isn't it a really > > cool and > > useful feature ? > > > We're always open to well-thought out and tested patches. See the > Wiki for info on contributing. > > -Grant > > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Boot Camp Training: > ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > -- Marcus Herou Solution Architect & Core Java developer Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com |
|
|
Re: Group by in Lucene ?Hey Marcus,
have you already implemented this feature? I'm searching a group by function for lucene, too. More precisely I need it in Compass, which is built on top of lucene. I was thinking about using a HitCollector to get only one result per group. How did you do it? Cheers, Nina
|
|
|
Re: Group by in Lucene ?Hi.
I did partly solve this with Solr with faceting but it does not solve the quite normally use feature in db's: num_en_entries = select count distinct(id) from BlogEntry where language='en' num_sv_entries = select count distinct(id) from BlogEntry where language='sv' it solves however the feature: select count(id),date from BlogEntry group by date I now need this feature elsewhere when parsing accesslogs etc so I am looking into MonetDB, LucidDB and FastBit. Sphinx search seem like they have something like this: http://www.sphinxsearch.com/docs/current.html#clustering Did you ever try a HitCollector ? //Marcus On Wed, Dec 5, 2007 at 1:17 PM, ninaS <nina256@...> wrote: > > Hey Marcus, > > have you already implemented this feature? > I'm searching a group by function for lucene, too. > > More precisely I need it in Compass, which is built on top of lucene. > > I was thinking about using a HitCollector to get only one result per group. > > How did you do it? > > Cheers, > Nina > > > > Marcus Herou-2 wrote: > > > > Cool. > > > > I'll do since this is a field which I can spend time in. > > > > Kindly > > > > //Marcus > > On 11/5/07, Grant Ingersoll <gsingers@...> wrote: > >> > >> > >> On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote: > >> > >> > Thanks. They seem to have got real far in the dev cycle on this. > >> > Seems like > >> > it will hit the road in Solr 1.3. > >> > > >> > However I would really like this feature to be developed for Core > >> > Lucene, > >> > how do I start that process? > >> > Develop it yourself you would say :) I'm serious isn't it a really > >> > cool and > >> > useful feature ? > >> > >> > >> We're always open to well-thought out and tested patches. See the > >> Wiki for info on contributing. > >> > >> -Grant > >> > >> > >> -------------------------- > >> Grant Ingersoll > >> http://lucene.grantingersoll.com > >> > >> Lucene Boot Camp Training: > >> ApacheCon Atlanta, Nov. 12, 2007. Sign up now! > http://www.apachecon.com > >> > >> Lucene Helpful Hints: > >> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >> http://wiki.apache.org/lucene-java/LuceneFAQ > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@... > >> For additional commands, e-mail: java-user-help@... > >> > >> > > > > > > -- > > Marcus Herou Solution Architect & Core Java developer Tailsweep AB > > +46702561312 > > marcus.herou@... > > http://www.tailsweep.com > > > > > > -- > View this message in context: > http://www.nabble.com/Group-by-in-Lucene---tf4749806.html#a14170395 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |
|
|
Re: Group by in Lucene ?Hello,
yes I tried HitCollector but I am not satisfied with it because you can not use sorting with HitCollector unless you implement a way to use TopFieldTocCollector. I did not manage to do that in a performant way. It is easier to first do a normal search und "group by" afterwards: Iterate through the result documents and take one of each group. Each document has a groupingKey. I remember which groupingKey is already used and don't take another document of this group into the result list. Regards, Nina |
|
|
Re: Group by in Lucene ?By the way: if you only need to count documents (count groups) HitCollector is a good choice. If you only count you don't need to sort anything.
|
|
|
Re: Group by in Lucene ?Hi.
This is way too slow I think since what you are explaining is something I already tested. However I might be using the HitCollector badly. Please prove me wrong. Supplying some code which I tested this with. It stores a hash of the value of the term in a TIntHashSet and just calculates the size of that set. This one takes approx 3 sec on about 0.5M rows = way too slow. main test class: public class GroupingTest { protected static final Log log = LogFactory.getLog(GroupingTest.class.getName()); static DateFormat df = new SimpleDateFormat("yyyy-MM-dd"); public static void main(String[] args) { Utils.initLogger(); String[] fields = {"uid","ip","date","siteId","visits","countryCode"}; try { IndexFactory fact = new IndexFactory(); String d = "/tmp/csvtest"; fact.initDir(d); IndexReader reader = fact.getReader(d); IndexSearcher searcher = fact.getSearcher(d, reader); QueryParser parser = new MultiFieldQueryParser(fields, fact.getAnalyzer()); Query q = parser.parse("date:20090125"); GroupingHitCollector coll = new GroupingHitCollector(); coll.setDistinct(true); coll.setGroupField("uid"); coll.setIndexReader(reader); long start = System.currentTimeMillis(); searcher.search(q, coll); long stop = System.currentTimeMillis(); System.out.println("Time: " + (stop-start) + ", distinct count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount()); } catch (Exception e) { log.error(e.toString(), e); } } } public class GroupingHitCollector extends HitCollector { protected IndexReader indexReader; protected String groupField; protected boolean distinct; //protected TLongHashSet set; protected TIntHashSet set; protected int distinctSize; int count = 0; int sum = 0; public GroupingHitCollector() { set = new TIntHashSet(); } public String getGroupField() { return groupField; } public void setGroupField(String groupField) { this.groupField = groupField; } public IndexReader getIndexReader() { return indexReader; } public void setIndexReader(IndexReader indexReader) { this.indexReader = indexReader; } public boolean isDistinct() { return distinct; } public void setDistinct(boolean distinct) { this.distinct = distinct; } public void collect(int doc, float score) { if(distinct) { try { Document document = this.indexReader.document(doc); if(document != null) { String s = document.get(groupField); if(s != null) { set.add(s.hashCode()); //set.add(Crc64.generate(s)); } } } catch (IOException e) { e.printStackTrace(); } } count++; sum += doc; // use it to avoid any possibility of being optimized away } public int getCount() { return count; } public int getSum() { return sum; } public int getDistinctCount() { distinctSize = set.size(); return distinctSize; } } On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina256@...> wrote: > > By the way: if you only need to count documents (count groups) HitCollector > is a good choice. If you only count you don't need to sort anything. > > > ninaS wrote: > > > > Hello, > > > > yes I tried HitCollector but I am not satisfied with it because you can > > not use sorting with HitCollector unless you implement a way to use > > TopFieldTocCollector. I did not manage to do that in a performant way. > > > > It is easier to first do a normal search und "group by" afterwards: > > > > Iterate through the result documents and take one of each group. Each > > document has a groupingKey. I remember which groupingKey is already used > > and don't take another document of this group into the result list. > > > > Regards, > > Nina > > > > -- > View this message in context: > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |
|
|
Re: Group by in Lucene ?Oh bytw, faceting is easy it's the distinct part I think is hard.
Example Lucene Facet: http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html On Wed, Jan 28, 2009 at 12:43 PM, Marcus Herou <marcus.herou@...>wrote: > Hi. > > This is way too slow I think since what you are explaining is something I > already tested. However I might be using the HitCollector badly. > > Please prove me wrong. Supplying some code which I tested this with. > It stores a hash of the value of the term in a TIntHashSet and just > calculates the size of that set. > This one takes approx 3 sec on about 0.5M rows = way too slow. > > > main test class: > public class GroupingTest > { > protected static final Log log = > LogFactory.getLog(GroupingTest.class.getName()); > static DateFormat df = new SimpleDateFormat("yyyy-MM-dd"); > public static void main(String[] args) > { > Utils.initLogger(); > String[] fields = > {"uid","ip","date","siteId","visits","countryCode"}; > try > { > IndexFactory fact = new IndexFactory(); > String d = "/tmp/csvtest"; > fact.initDir(d); > IndexReader reader = fact.getReader(d); > IndexSearcher searcher = fact.getSearcher(d, reader); > QueryParser parser = new MultiFieldQueryParser(fields, > fact.getAnalyzer()); > Query q = parser.parse("date:20090125"); > > > GroupingHitCollector coll = new GroupingHitCollector(); > coll.setDistinct(true); > coll.setGroupField("uid"); > coll.setIndexReader(reader); > long start = System.currentTimeMillis(); > searcher.search(q, coll); > long stop = System.currentTimeMillis(); > System.out.println("Time: " + (stop-start) + ", distinct > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount()); > } > catch (Exception e) > { > log.error(e.toString(), e); > } > } > } > > > public class GroupingHitCollector extends HitCollector > { > protected IndexReader indexReader; > protected String groupField; > protected boolean distinct; > //protected TLongHashSet set; > protected TIntHashSet set; > protected int distinctSize; > > int count = 0; > int sum = 0; > > public GroupingHitCollector() > { > set = new TIntHashSet(); > } > > public String getGroupField() > { > return groupField; > } > > public void setGroupField(String groupField) > { > this.groupField = groupField; > } > > public IndexReader getIndexReader() > { > return indexReader; > } > > public void setIndexReader(IndexReader indexReader) > { > this.indexReader = indexReader; > } > > public boolean isDistinct() > { > return distinct; > } > > public void setDistinct(boolean distinct) > { > this.distinct = distinct; > } > > public void collect(int doc, float score) > { > if(distinct) > { > try > { > Document document = this.indexReader.document(doc); > if(document != null) > { > String s = document.get(groupField); > if(s != null) > { > set.add(s.hashCode()); > //set.add(Crc64.generate(s)); > } > } > } > catch (IOException e) > { > e.printStackTrace(); > } > } > count++; > sum += doc; // use it to avoid any possibility of being optimized > away > } > > public int getCount() { return count; } > public int getSum() { return sum; } > > public int getDistinctCount() > { > distinctSize = set.size(); > return distinctSize; > > } > } > > > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina256@...> wrote: > >> >> By the way: if you only need to count documents (count groups) >> HitCollector >> is a good choice. If you only count you don't need to sort anything. >> >> >> ninaS wrote: >> > >> > Hello, >> > >> > yes I tried HitCollector but I am not satisfied with it because you can >> > not use sorting with HitCollector unless you implement a way to use >> > TopFieldTocCollector. I did not manage to do that in a performant way. >> > >> > It is easier to first do a normal search und "group by" afterwards: >> > >> > Iterate through the result documents and take one of each group. Each >> > document has a groupingKey. I remember which groupingKey is already used >> > and don't take another document of this group into the result list. >> > >> > Regards, >> > Nina >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@... >> For additional commands, e-mail: java-user-help@... >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.herou@... > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |
|
|
Re: Group by in Lucene ?At a quick glance, this line is really suspicious:
Document document = this.indexReader.document(doc) From the Javadoc for HitCollector.collect: Note: This is called in an inner search loop. For good search performance, implementations of this method should not call Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on every document number encountered. Doing so can slow searches by an order of magnitude or more. You're loading the document each time through the loop. I think you'd get much better performance by making sure that your groupField is indexed, then use TermDocs (TermEnum?) to get the value of the field. Best Erick On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.herou@...>wrote: > Hi. > > This is way too slow I think since what you are explaining is something I > already tested. However I might be using the HitCollector badly. > > Please prove me wrong. Supplying some code which I tested this with. > It stores a hash of the value of the term in a TIntHashSet and just > calculates the size of that set. > This one takes approx 3 sec on about 0.5M rows = way too slow. > > > main test class: > public class GroupingTest > { > protected static final Log log = > LogFactory.getLog(GroupingTest.class.getName()); > static DateFormat df = new SimpleDateFormat("yyyy-MM-dd"); > public static void main(String[] args) > { > Utils.initLogger(); > String[] fields = > {"uid","ip","date","siteId","visits","countryCode"}; > try > { > IndexFactory fact = new IndexFactory(); > String d = "/tmp/csvtest"; > fact.initDir(d); > IndexReader reader = fact.getReader(d); > IndexSearcher searcher = fact.getSearcher(d, reader); > QueryParser parser = new MultiFieldQueryParser(fields, > fact.getAnalyzer()); > Query q = parser.parse("date:20090125"); > > > GroupingHitCollector coll = new GroupingHitCollector(); > coll.setDistinct(true); > coll.setGroupField("uid"); > coll.setIndexReader(reader); > long start = System.currentTimeMillis(); > searcher.search(q, coll); > long stop = System.currentTimeMillis(); > System.out.println("Time: " + (stop-start) + ", distinct > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount()); > } > catch (Exception e) > { > log.error(e.toString(), e); > } > } > } > > > public class GroupingHitCollector extends HitCollector > { > protected IndexReader indexReader; > protected String groupField; > protected boolean distinct; > //protected TLongHashSet set; > protected TIntHashSet set; > protected int distinctSize; > > int count = 0; > int sum = 0; > > public GroupingHitCollector() > { > set = new TIntHashSet(); > } > > public String getGroupField() > { > return groupField; > } > > public void setGroupField(String groupField) > { > this.groupField = groupField; > } > > public IndexReader getIndexReader() > { > return indexReader; > } > > public void setIndexReader(IndexReader indexReader) > { > this.indexReader = indexReader; > } > > public boolean isDistinct() > { > return distinct; > } > > public void setDistinct(boolean distinct) > { > this.distinct = distinct; > } > > public void collect(int doc, float score) > { > if(distinct) > { > try > { > Document document = this.indexReader.document(doc); > if(document != null) > { > String s = document.get(groupField); > if(s != null) > { > set.add(s.hashCode()); > //set.add(Crc64.generate(s)); > } > } > } > catch (IOException e) > { > e.printStackTrace(); > } > } > count++; > sum += doc; // use it to avoid any possibility of being optimized > away > } > > public int getCount() { return count; } > public int getSum() { return sum; } > > public int getDistinctCount() > { > distinctSize = set.size(); > return distinctSize; > } > } > > > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina256@...> wrote: > > > > > By the way: if you only need to count documents (count groups) > HitCollector > > is a good choice. If you only count you don't need to sort anything. > > > > > > ninaS wrote: > > > > > > Hello, > > > > > > yes I tried HitCollector but I am not satisfied with it because you can > > > not use sorting with HitCollector unless you implement a way to use > > > TopFieldTocCollector. I did not manage to do that in a performant way. > > > > > > It is easier to first do a normal search und "group by" afterwards: > > > > > > Iterate through the result documents and take one of each group. Each > > > document has a groupingKey. I remember which groupingKey is already > used > > > and don't take another document of this group into the result list. > > > > > > Regards, > > > Nina > > > > > > > -- > > View this message in context: > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@... > > For additional commands, e-mail: java-user-help@... > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.herou@... > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > |
|
|
Re: Group by in Lucene ?Group-by in Lucene/Solr has not been solved in a great general way yet
to my knowledge. Ideally, we would want a solution that does not need to fit into memory. However, you need the value of the field for each document. to do the grouping As you are finding, this is not cheap to get. Currently, the efficient way to get it is to use a FieldCache. This, however, requires that every distinct value can fit into memory. Once you have efficient access to the values, you need to be able to efficiently group the results, again not bounded by memory (which we already are with the FieldCache). There are quite a few ways to do this. The simplest is to group until you have used all the memory you want, then for everything left, anything that doesnt match a group, write it to a file, if it does, increment the group count. Use the overflow file as the input in the next run, repeat until there is no overflow. You can improve on that by partitioning the overflow file. And then there are a dozen other methods. Solr has a patch in JIRA that uses a sorting method. First the results are sorted on the group-by field, then scanned through for grouping - all field values that are the same will be next to each other. Finally, if you really wanted to sort on a different field, another sort is applied. Thats not ideal IMO, but its a start. - Mark --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@... For additional commands, e-mail: java-user-help@... |
|
|
Re: Group by in Lucene ?Yep, you are correct, this is a lousy implementation which I knew when I
wrote it. I'm not interested in the entire document just the grouping term and the docId which it is connected to. So how do I get hold of the TermDocs for the grouping field ? I mean I probably first need to perform the query: searcher.search(...) which would give me set of doc ids. Then I need to group them all by for instance: "ip-address", save each ip-address in another set and in the end calculate the size of that set. i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where date='2009-01-25' (optionally group by ipAddress ?) //Marcus On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson <erickerickson@...>wrote: > At a quick glance, this line is really suspicious: > > Document document = this.indexReader.document(doc) > > From the Javadoc for HitCollector.collect: > > Note: This is called in an inner search loop. For good search performance, > implementations of this method should not call > > Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or > > IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on > every document number encountered. Doing so can slow searches by an > order > of magnitude or more. > > You're loading the document each time through the loop. I think you'd get > much better > performance by making sure that your groupField is indexed, then use > TermDocs (TermEnum?) > to get the value of the field. > > Best > Erick > > > > On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.herou@... > >wrote: > > > Hi. > > > > This is way too slow I think since what you are explaining is something I > > already tested. However I might be using the HitCollector badly. > > > > Please prove me wrong. Supplying some code which I tested this with. > > It stores a hash of the value of the term in a TIntHashSet and just > > calculates the size of that set. > > This one takes approx 3 sec on about 0.5M rows = way too slow. > > > > > > main test class: > > public class GroupingTest > > { > > protected static final Log log = > > LogFactory.getLog(GroupingTest.class.getName()); > > static DateFormat df = new SimpleDateFormat("yyyy-MM-dd"); > > public static void main(String[] args) > > { > > Utils.initLogger(); > > String[] fields = > > {"uid","ip","date","siteId","visits","countryCode"}; > > try > > { > > IndexFactory fact = new IndexFactory(); > > String d = "/tmp/csvtest"; > > fact.initDir(d); > > IndexReader reader = fact.getReader(d); > > IndexSearcher searcher = fact.getSearcher(d, reader); > > QueryParser parser = new MultiFieldQueryParser(fields, > > fact.getAnalyzer()); > > Query q = parser.parse("date:20090125"); > > > > > > GroupingHitCollector coll = new GroupingHitCollector(); > > coll.setDistinct(true); > > coll.setGroupField("uid"); > > coll.setIndexReader(reader); > > long start = System.currentTimeMillis(); > > searcher.search(q, coll); > > long stop = System.currentTimeMillis(); > > System.out.println("Time: " + (stop-start) + ", distinct > > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount()); > > } > > catch (Exception e) > > { > > log.error(e.toString(), e); > > } > > } > > } > > > > > > public class GroupingHitCollector extends HitCollector > > { > > protected IndexReader indexReader; > > protected String groupField; > > protected boolean distinct; > > //protected TLongHashSet set; > > protected TIntHashSet set; > > protected int distinctSize; > > > > int count = 0; > > int sum = 0; > > > > public GroupingHitCollector() > > { > > set = new TIntHashSet(); > > } > > > > public String getGroupField() > > { > > return groupField; > > } > > > > public void setGroupField(String groupField) > > { > > this.groupField = groupField; > > } > > > > public IndexReader getIndexReader() > > { > > return indexReader; > > } > > > > public void setIndexReader(IndexReader indexReader) > > { > > this.indexReader = indexReader; > > } > > > > public boolean isDistinct() > > { > > return distinct; > > } > > > > public void setDistinct(boolean distinct) > > { > > this.distinct = distinct; > > } > > > > public void collect(int doc, float score) > > { > > if(distinct) > > { > > try > > { > > Document document = this.indexReader.document(doc); > > if(document != null) > > { > > String s = document.get(groupField); > > if(s != null) > > { > > set.add(s.hashCode()); > > //set.add(Crc64.generate(s)); > > } > > } > > } > > catch (IOException e) > > { > > e.printStackTrace(); > > } > > } > > count++; > > sum += doc; // use it to avoid any possibility of being optimized > > away > > } > > > > public int getCount() { return count; } > > public int getSum() { return sum; } > > > > public int getDistinctCount() > > { > > distinctSize = set.size(); > > return distinctSize; > > } > > } > > > > > > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina256@...> wrote: > > > > > > > > By the way: if you only need to count documents (count groups) > > HitCollector > > > is a good choice. If you only count you don't need to sort anything. > > > > > > > > > ninaS wrote: > > > > > > > > Hello, > > > > > > > > yes I tried HitCollector but I am not satisfied with it because you > can > > > > not use sorting with HitCollector unless you implement a way to use > > > > TopFieldTocCollector. I did not manage to do that in a performant > way. > > > > > > > > It is easier to first do a normal search und "group by" afterwards: > > > > > > > > Iterate through the result documents and take one of each group. Each > > > > document has a groupingKey. I remember which groupingKey is already > > used > > > > and don't take another document of this group into the result list. > > > > > > > > Regards, > > > > Nina > > > > > > > > > > -- > > > View this message in context: > > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html > > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@... > > > For additional commands, e-mail: java-user-help@... > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.herou@... > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |
|
|
Re: Group by in Lucene ?Yep. Probably an external sort should be used when flushing to disk. I have
written such code so that is probably a no brainer, the problem is to get it speedy :) <http://dev.tailsweep.com/projects/utils/apidocs/org/tailsweep/utils/sort/TupleSorter.html> http://dev.tailsweep.com/projects/utils/apidocs/com/tailsweep/utils/sort/TupleSorter.html Another way could be to use HDFS and MapFiles/SequenceFiles Not speedy at all but scalable. Thinking of writing my own Inverted Index, specialized for these kind of operations. Any pointers in where to start look for material for that ? /Marcus On Wed, Jan 28, 2009 at 5:02 PM, Mark Miller <markrmiller@...> wrote: > Group-by in Lucene/Solr has not been solved in a great general way yet to > my knowledge. > > Ideally, we would want a solution that does not need to fit into memory. > However, you need the value of the field for each document. to do the > grouping As you are finding, this is not cheap to get. Currently, the > efficient way to get it is to use a FieldCache. This, however, requires that > every distinct value can fit into memory. > > Once you have efficient access to the values, you need to be able to > efficiently group the results, again not bounded by memory (which we already > are with the FieldCache). > > There are quite a few ways to do this. The simplest is to group until you > have used all the memory you want, then for everything left, anything that > doesnt match a group, write it to a file, if it does, increment the group > count. Use the overflow file as the input in the next run, repeat until > there is no overflow. You can improve on that by partitioning the overflow > file. > > And then there are a dozen other methods. > > Solr has a patch in JIRA that uses a sorting method. First the results are > sorted on the group-by field, then scanned through for grouping - all field > values that are the same will be next to each other. Finally, if you really > wanted to sort on a different field, another sort is applied. Thats not > ideal IMO, but its a start. > > - Mark > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.herou@... http://www.tailsweep.com/ http://blogg.tailsweep.com/ |
|
|
Re: Group by in Lucene ? |
|
|
Re: Group by in Lucene ?Don't overlook Solr: http://lucene.apache.org/solr
Erik On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote: > > http://code.google.com/p/bobo-browse > > looks like it may be the ticket. > > Marc > > -- > View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@... For additional commands, e-mail: java-user-help@... |
| Free embeddable forum powered by Nabble | Forum Help |