|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
|
|
Re: Offer to submit some custom enhancementsif you wish to write out only the known parts (or parts that you
need). say responseheader, results or the output of any known handler, it would be fine. But it cannot be a standard responsewriter unless it supports NamedList format But it is OK. One quick question? what is the client platform on which the library is going to run on that you need protocol buffers On Thu, Oct 16, 2008 at 9:27 PM, Feak, Todd <Todd.Feak@...> wrote: > Answering Grant Ingersoll's question for use case as well, which may > clarify. > > Without revealing TOO much about our internal structure, we are in the > process of replacing SOAP communications in house with Protocol Buffers. > We did evaluate Thrift as well, but decided on Protocol Buffers. A large > effort for that conversion is well under way. I've been asked if Solr > can support this, and to create a prototype to see if there are similar > gains. I don't imagine it will be the gains that we've seen over SOAP, > but I do foresee some amount of throughput increase. > > So, in response to suggestion for other binary formatting technologies, > my hands are tied. This is the prototype I have to work on for now. If > it works out, I will gladly share it. If not, I will share why, and > hopefully save others some time. > > As for Protocol Buffers not supporting the NamedList structure. Google's > documentation strongly suggests that intermediate (bean) classes be > created, instead of trying to marshall and de-marshall your object model > directly. This intermediate model doesn't have to precisely mirror the > NamedList, it can be *any* compromise that gets the data from A to B, as > long as the NamedList can be reconstituted on the other side. I'm sure > something can be done. > > Thanks, > Todd Feak > > > -----Original Message----- > From: Shalin Shekhar Mangar [mailto:shalinmangar@...] > Sent: Thursday, October 16, 2008 8:17 AM > To: solr-dev@... > Subject: Re: Offer to submit some custom enhancements > > Hi Todd, > > AFAIK, protocol buffers cannot be used for Solr because it is unable to > support the NamedList structure that all Solr components use. > > The binary protocol (NamedListCodec) that SolrJ uses to communicate with > Solr server is extremely optimized for our response format. However it > is > Java only. > > There are other projects such as Apache Thrift ( > http://incubator.apache.org/thrift/) and Etch (both in incubation) which > can > be looked at. There are a few issues in Thrift which may help us in the > future: > > https://issues.apache.org/jira/browse/THRIFT-110 > https://issues.apache.org/jira/browse/THRIFT-122 > > On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd > <Todd.Feak@...>wrote: > >> Reposting, as I inadvertently thread hijacked on the first one. My > bad. >> >> Hi all, >> >> I have a handful of custom classes that we've created for our purposes >> here. I'd like to share them if you think they have value for the rest >> of the community, but I wanted to check here before creating JIRA >> tickets and patches. >> >> Here's what I have: >> >> 1. DoubleMetaphoneFilter and Factory. This replaces usage of the >> PhoneticFilter and Factory allowing access to set maxCodeLength() on > the >> DoubleMetaphone encoder and access to the "alternate" encodings that > the >> encoder provides for some words. >> >> 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and >> Latin alphabet) exist in both a FullWidth and HalfWidth form. This >> filter normalizes by switching to the FullWidth form for all the >> characters. I have seen at least one JIRA ticket about this issue. > This >> implementation doesn't rely on Java 1.6. >> >> 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be >> translated to Katakana. This filter normalizes to Katakana so that > data >> and queries can come in either way and get hits. >> >> >> Also, I have been requested to create a prototype that you may be >> interested in. I'm to construct a QueryResponseWriter that returns >> documents using Google's Protocol Buffers. This would rely on an >> existing patch that exposes the OutputStream, but I would like to > start >> the work soon. Are there license concerns that would block sharing > this >> with you? Is there any interest in this? >> >> Thanks for your consideration, >> Todd Feak >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul |
|
|
RE: Offer to submit some custom enhancementsBoth Java and C++ clients would potentially be using the Protocol Buffers.
Performance and ease of adoption will determine what actually gets used. I'm implementing QueryResponseWriter right now and able to handle the NamedList, but due to lack inheritance in the Protocol Buffer object model, each type that is placed into the NamedList needs special handling. However, that doesn’t appear to be any different then the JSON or XML response writers, so I am hopeful this could work. The biggest stumbling block is the lack of access to OutputStream instead of Writer, but I saw a patch to address that. -Todd -----Original Message----- From: Noble Paul നോബിള് नोब्ळ् [mailto:noble.paul@...] Sent: Friday, October 17, 2008 4:07 AM To: solr-dev@... Subject: Re: Offer to submit some custom enhancements if you wish to write out only the known parts (or parts that you need). say responseheader, results or the output of any known handler, it would be fine. But it cannot be a standard responsewriter unless it supports NamedList format But it is OK. One quick question? what is the client platform on which the library is going to run on that you need protocol buffers On Thu, Oct 16, 2008 at 9:27 PM, Feak, Todd <Todd.Feak@...> wrote: > Answering Grant Ingersoll's question for use case as well, which may > clarify. > > Without revealing TOO much about our internal structure, we are in the > process of replacing SOAP communications in house with Protocol Buffers. > We did evaluate Thrift as well, but decided on Protocol Buffers. A large > effort for that conversion is well under way. I've been asked if Solr > can support this, and to create a prototype to see if there are similar > gains. I don't imagine it will be the gains that we've seen over SOAP, > but I do foresee some amount of throughput increase. > > So, in response to suggestion for other binary formatting technologies, > my hands are tied. This is the prototype I have to work on for now. If > it works out, I will gladly share it. If not, I will share why, and > hopefully save others some time. > > As for Protocol Buffers not supporting the NamedList structure. Google's > documentation strongly suggests that intermediate (bean) classes be > created, instead of trying to marshall and de-marshall your object model > directly. This intermediate model doesn't have to precisely mirror the > NamedList, it can be *any* compromise that gets the data from A to B, as > long as the NamedList can be reconstituted on the other side. I'm sure > something can be done. > > Thanks, > Todd Feak > > > -----Original Message----- > From: Shalin Shekhar Mangar [mailto:shalinmangar@...] > Sent: Thursday, October 16, 2008 8:17 AM > To: solr-dev@... > Subject: Re: Offer to submit some custom enhancements > > Hi Todd, > > AFAIK, protocol buffers cannot be used for Solr because it is unable to > support the NamedList structure that all Solr components use. > > The binary protocol (NamedListCodec) that SolrJ uses to communicate with > Solr server is extremely optimized for our response format. However it > is > Java only. > > There are other projects such as Apache Thrift ( > http://incubator.apache.org/thrift/) and Etch (both in incubation) which > can > be looked at. There are a few issues in Thrift which may help us in the > future: > > https://issues.apache.org/jira/browse/THRIFT-110 > https://issues.apache.org/jira/browse/THRIFT-122 > > On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd > <Todd.Feak@...>wrote: > >> Reposting, as I inadvertently thread hijacked on the first one. My > bad. >> >> Hi all, >> >> I have a handful of custom classes that we've created for our purposes >> here. I'd like to share them if you think they have value for the rest >> of the community, but I wanted to check here before creating JIRA >> tickets and patches. >> >> Here's what I have: >> >> 1. DoubleMetaphoneFilter and Factory. This replaces usage of the >> PhoneticFilter and Factory allowing access to set maxCodeLength() on > the >> DoubleMetaphone encoder and access to the "alternate" encodings that > the >> encoder provides for some words. >> >> 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and >> Latin alphabet) exist in both a FullWidth and HalfWidth form. This >> filter normalizes by switching to the FullWidth form for all the >> characters. I have seen at least one JIRA ticket about this issue. > This >> implementation doesn't rely on Java 1.6. >> >> 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be >> translated to Katakana. This filter normalizes to Katakana so that > data >> and queries can come in either way and get hits. >> >> >> Also, I have been requested to create a prototype that you may be >> interested in. I'm to construct a QueryResponseWriter that returns >> documents using Google's Protocol Buffers. This would rely on an >> existing patch that exposes the OutputStream, but I would like to > start >> the work soon. Are there license concerns that would block sharing > this >> with you? Is there any interest in this? >> >> Thanks for your consideration, >> Todd Feak >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul |
|
|
Re: Offer to submit some custom enhancementsOn Oct 16, 2008, at 11:35 AM, Feak, Todd wrote: > Regarding the location of the Filters and Factories ... I agree that > the > Filters would be best located in Lucene, as users of both packages > would > then have access. > > What I'm struggling with is the timing of putting Filters into Lucene, > and then Factories into Solr. The Factories in Solr would be useless > until the Filters had been accepted and released in Lucene, then the > Lucene version upgraded in Solr. What I'm inclined to do is release > the > Filters to both, and have the Factories point to the Solr version, > until > they become available in the Lucene version, then switch them over and > drop the Solr version. > > How is this handled with other new Filter/Factory sets? > > Just let me know, and I'll get the ball rolling on those. > We have not yet seen Filters come out of solr then get 'promoted' to lucene, so its all new... For simplicity, I'm sure you want to keep the Filter and FilterFactory in the same patch -- otherwise it would be to difficult to keep in sync. I would suggest (others may feel different) making the patches and issues in Solr, once they are in JIRA, the commiters can figure out exactly where they should live within lucene and/or solr. So I suggest opening three issues in solr and attaching files there. ryan |
|
|
RE: Offer to submit some custom enhancements>>> But it cannot be a standard responsewriter unless it supports NamedList format It has to be able to handle NamedList's contained in SolrQueryResponse, but it can output them in whatever format it wants for going over the wire ... whether the client on the other side of the Protocol Buffer knows how to make sense of the data you send it is another matter : biggest stumbling block is the lack of access to OutputStream instead of : Writer, but I saw a patch to address that. no patch needed, implement BinaryQueryResponseWriter and you'll be given a raw OutputStream. -Hoss |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch Removed the alternate algorithm implementations, but left in some of the framework for adding them. The Carrot2 maintainers are likely to remove Fuzzy Ants and some of the other implementations in 3.0, which is due out sometime soon. Thus, I'd rather not support something that isn't recommended. I'm likely to commit this fairly soon. -Grant > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch OK, here's a first scratch at the component side of document clustering. There are no implementations of the DocumentClusteringEngine yet, so I am bit hesitant to even throw out a proposed API for that yet, but the current one is pretty generic, which is both good and bad. I don't particularly like passing around something as open as SolrParams, but I don't think I can pin down a generic set of explicit parameters either. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch How about a patch where the tests pass? :-) Here ya go... > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641812#action_12641812 ] Vaijanath N. Rao commented on SOLR-769: --------------------------------------- Hi Grant, For just minor copying of .txt file I got this working without any problems. So what would be the procedure to add some clustering code beyond carrot or other available libraries. --Thanks and Regards Vaijanath > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641823#action_12641823 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- {quote}So what would be the procedure to add some clustering code beyond carrot or other available libraries. {quote} Essentially, you need to implement either a SearchClusteringEngine or a DocumentClusteringEngine and then hook declare it in the SearchComponent configuration, as is done with the Carrot2 example here: {code} <lst name="engine"> <!-- The name, only one can be named "default" --> <str name="name">default</str> <!-- Carrot2 specific parameters. See the Carrot2 site for details on setting. --> <!-- carrot.algorithm: Optional. Currently only lingo is supported pending the release of Carrot2 3.0. --> <str name="carrot.algorithm">lingo</str> <!-- Lingo specific --> <float name="carrot.lingo.threshold.clusterAssignment">0.150</float> <float name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float> </lst> {code} or, in the mock setup: {code} <lst name="engine"> <!-- The name, only one can be named "default" --> <str name="name">docEngine</str> <str name="classname">org.apache.solr.handler.clustering.MockDocumentClusteringEngine</str> </lst> {code} If you don't declare the classname value, then it assumes the Carrot implementation. Naturally, you need to take care of all the libraries being available to Solr, etc. just as you would for any plugin. Since you are interested in clustering, Vaijanath, it would be good to get your feedback on the APIs. Are you doing full document clustering or just search snippet clustering? Also, if you are using an open source clustering library that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider contributing an implementation of the engine and then we can make it available to everyone. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641832#action_12641832 ] Vaijanath N. Rao commented on SOLR-769: --------------------------------------- Hi Grant, Till now I have worked mostly with full document clustering. Had never thought of search snippet clustering. I will definitely pitch in for clustering library. There are many libraries which have favourable/acceptable licensing terms which can be added to Solr. --Thanks and Regards Vaijanath > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642179#action_12642179 ] Bruce Ritchie commented on SOLR-769: ------------------------------------ Grant, This patch looks very promising, I can't wait to give it a try and find a way to incorporate it into a project I'm working on (when it's ready of course ... likely not till after Carrot2 3 is released though) Can you give a quick estimate as to the performance impact of enabling clustering in search results mode? In the example @ http://wiki.apache.org/solr/ClusteringFullResultsExample the query time seems pretty high and I was wondering if that was a result of this patch or something else? Thanks, Bruce Ritchie > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642182#action_12642182 ] Stanislaw Osinski commented on SOLR-769: ---------------------------------------- Bruce, For performance of the clustering algorithm alone, please take a look at: http://project.carrot2.org/algorithms.html Obviously, you'd need to add the overhead of fetching the snippets / documents from the index. Not sure how many are fetched and whether they come from Solr's cache or not, so not sure if clustering or fetching time is prevailing. Cheers, Staszek > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642187#action_12642187 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Hi Bruce, I haven't done any perf. testing, as I've been focused on functionality first. However, I'm not sure whether that query was the first one run, or not, so I don't know the status of the searcher, etc. I'm pretty sure I don't have any warming queries, etc. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch Updated to trunk. See http://wiki.apache.org/solr/ClusteringComponent > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch Here's a patch for Carrot2 3.0 that COMPILES ONLY. You will need to download the clustering-libs.tar.gz from http://people.apache.org/~gsingers/clustering-libs.tar.gz as it is too big to upload to JIRA. TODO: 1. Tests passing and more tests 2. Update NOTICE.txt and LICENSE.txt 3. Get trimmed down Carrot2 library that doesn't have all the Document Source dependencies, and preferably the web services deps. Solr doesn't need the Google, etc. API deps. Preferably remove the LGPL deps too, but for now, they are downloaded via ANT from the Maven repositories. 4. Update the Maven template 5. Hook in the builds 6. Make sure the example works > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672281#action_12672281 ] Stanislaw Osinski commented on SOLR-769: ---------------------------------------- Hi Grant, I've added a Carrot2 issue referring to point 3 on your TODO list: http://issues.carrot2.org/browse/CARROT-457. I'll be looking into this over the weekend. Staszek > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: ----------------------------------- Attachment: SOLR-769-lib.zip SOLR-769.patch Yet another patch, this time with passing unit tests and working example. Will make some more comments in a sec. Please use SOLR-769-lib.zip libs with this patch. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680942#action_12680942 ] Stanislaw Osinski commented on SOLR-769: ---------------------------------------- Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} <lst name="clusters"> <int name="numClusters">3</int> <lst name="cluster"> <lst name="labels"> <str name="label">GPU VPU Clocked</str> </lst> <lst name="docs"> <str name="doc">EN7800GTX/2DHTV/256M</str> <str name="doc">100-435805</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Hard Drive</str> </lst> <lst name="docs"> <str name="doc">6H500F0</str> <str name="doc">SP2514N</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Other Topics</str> </lst> <lst name="docs"> <str name="doc">9885A004</str> </lst> </lst> {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called "clusters" so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? # h4 Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. # h4 Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680942#action_12680942 ] Stanislaw Osinski edited comment on SOLR-769 at 3/11/09 10:46 AM: ------------------------------------------------------------------ Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: 1. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} <lst name="clusters"> <int name="numClusters">3</int> <lst name="cluster"> <lst name="labels"> <str name="label">GPU VPU Clocked</str> </lst> <lst name="docs"> <str name="doc">EN7800GTX/2DHTV/256M</str> <str name="doc">100-435805</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Hard Drive</str> </lst> <lst name="docs"> <str name="doc">6H500F0</str> <str name="doc">SP2514N</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Other Topics</str> </lst> <lst name="docs"> <str name="doc">9885A004</str> </lst> </lst> {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called "clusters" so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. \\ \\ \\ 2. Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. \\ \\ \\ 3. Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? \\ \\ \\ 4. Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. \\ \\ \\ 5. Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. was (Author: stanislaw.osinski): Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} <lst name="clusters"> <int name="numClusters">3</int> <lst name="cluster"> <lst name="labels"> <str name="label">GPU VPU Clocked</str> </lst> <lst name="docs"> <str name="doc">EN7800GTX/2DHTV/256M</str> <str name="doc">100-435805</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Hard Drive</str> </lst> <lst name="docs"> <str name="doc">6H500F0</str> <str name="doc">SP2514N</str> </lst> </lst> <lst name="cluster"> <lst name="labels"> <str name="label">Other Topics</str> </lst> <lst name="docs"> <str name="doc">9885A004</str> </lst> </lst> {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called "clusters" so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? # h4 Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. # h4 Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: ----------------------------------- Attachment: (was: SOLR-769-lib.zip) > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |