|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
|
|
[jira] Created: (SOLR-769) Support Document and Search Result clusteringSupport Document and Search Result clustering
--------------------------------------------- Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638637#action_12638637 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Starting docs at http://wiki.apache.org/solr/ClusteringComponent > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638791#action_12638791 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Patch soon, as a start. I'm going to check in the basic directory structure and libs, and then provide a patch with the source that we can iterate on. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Work started: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on SOLR-769 started by Grant Ingersoll. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: clustering-libs.tar Clustering libs > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch First draft of a patch. Notes: 1. Carrot2 uses the snowball stemmers, but it shouldn't clash, b/c it actually slightly changes the names of them to be like englishStemmer (as opposed to EnglishStemmer). I'm debating whether or not to just re-implement this so that it can use the same snowball stemmers we use in Solr. Probably not a big deal. 2. I haven't implemented document clustering yet. To do this, I need to setup a background thread that will be spawned to do the clustering, since it is presumably going through some large set of documents and clustering them. To do this, it will probably require term vectors. This will introduce a dep. on Mahout, so I'll need a version of that library too. 3. It would be really cool for the Carrot2 implementation to support using other clustering algs besides Lingo. Basically, this just needs to be factored into the configuration and the jars included in the distribution. This is not a high priority for me at the moment. TODO: More tests. Decide on output format Implement doc. clustering framework part (i.e. spawning of threads, commands) ???? > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch More updates, added example > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638814#action_12638814 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Still to do, more testing, get feedback, implement basics of doc. clustering. This last piece will take some more design work. Also need to validate some more that the results make sense for search results clustering, but my first look suggests they do. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638875#action_12638875 ] Andrzej Bialecki commented on SOLR-769: ---------------------------------------- FYI, Carrot2 does support a handful of different clustering algorithms (the ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo). > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638924#action_12638924 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Yeah, I probably will include the other jars and make it easy to include them. For now, I wanted to get something basic working for a talk I'm giving on Wednesday night ;-) > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: SOLR-769.patch Here's a patch that actually passes the tests. Note, there's still a little oddity with the Snowball program that needs to be worked out, thus I don't recommend running this patch in production yet. The issue is that both Carrot and Solr have deps on Snowball, but on different versions, furthermore, Carrot2 goes one further and slightly modifies the names of Snowball. I will upload new libs in a minute. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: --------------------------------- Attachment: clustering-libs.tar Untar in contrib/clustering/lib. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (SOLR-769) Support Document and Search Result clustering[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639835#action_12639835 ] Grant Ingersoll commented on SOLR-769: -------------------------------------- Note, also, that even though I put in support for some of the other C2 (Carrot2) algorithms, I don't think they quite work yet. I think they require passing in more parameters to set some algorithm properties (for instance, for Fuzzy Ants, I think you need to set a depth) and I haven't figured those out yet. If you have C2 experience, insight would be appreciated. For now, stick to Lingo. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch > > > Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??????) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
Offer to submit some custom enhancementsHi all,
I have a handful of custom classes that we've created for our purposes here. I'd like to share them if you think they have value for the rest of the community, but I wanted to check here before creating JIRA tickets and patches. Here's what I have: 1. DoubleMetaphoneFilter and Factory. This replaces usage of the PhoneticFilter and Factory allowing access to set maxCodeLength() on the DoubleMetaphone encoder and access to the "alternate" encodings that the encoder provides for some words. 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and Latin alphabet) exist in both a FullWidth and HalfWidth form. This filter normalizes by switching to the FullWidth form for all the characters. I have seen at least one JIRA ticket about this issue. This implementation doesn't rely on Java 1.6. 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be translated to Katakana. This filter normalizes to Katakana so that data and queries can come in either way and get hits. Also, I have been requested to create a prototype that you may be interested in. I'm to construct a QueryResponseWriter that returns documents using Google's Protocol Buffers. This would rely on an existing patch that exposes the OutputStream, but I would like to start the work soon. Are there license concerns that would block sharing this with you? Is there any interest in this? Thanks for your consideration, Todd Feak |
|
|
|
|
|
Re: Offer to submit some custom enhancementsHi Todd,
All of these sound good. Personally, I think analyzers like these belong in Lucene's contrib/analyzers package, with Solr factory implementations built on those, but that's your call. As for the Protocol Buffers, I am assuming you mean: http://code.google.com/p/protobuf/ That is an Apache license, so it is fine to incorporate. Sounds like it might be a contrib to start, but that's just my take. Sounds like they might be worth using in SolrJ and for distributed, but am interested in how it compares to other similar technologies. Can you share your use case for them? -Grant On Oct 15, 2008, at 2:48 PM, Feak, Todd wrote: > Reposting, as I inadvertently thread hijacked on the first one. My > bad. > > Hi all, > > I have a handful of custom classes that we've created for our purposes > here. I'd like to share them if you think they have value for the rest > of the community, but I wanted to check here before creating JIRA > tickets and patches. > > Here's what I have: > > 1. DoubleMetaphoneFilter and Factory. This replaces usage of the > PhoneticFilter and Factory allowing access to set maxCodeLength() on > the > DoubleMetaphone encoder and access to the "alternate" encodings that > the > encoder provides for some words. > > 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and > Latin alphabet) exist in both a FullWidth and HalfWidth form. This > filter normalizes by switching to the FullWidth form for all the > characters. I have seen at least one JIRA ticket about this issue. > This > implementation doesn't rely on Java 1.6. > > 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be > translated to Katakana. This filter normalizes to Katakana so that > data > and queries can come in either way and get hits. > > > Also, I have been requested to create a prototype that you may be > interested in. I'm to construct a QueryResponseWriter that returns > documents using Google's Protocol Buffers. This would rely on an > existing patch that exposes the OutputStream, but I would like to > start > the work soon. Are there license concerns that would block sharing > this > with you? Is there any interest in this? > > Thanks for your consideration, > Todd Feak -------------------------- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ |
|
|
Re: Offer to submit some custom enhancementsHi Todd,
AFAIK, protocol buffers cannot be used for Solr because it is unable to support the NamedList structure that all Solr components use. The binary protocol (NamedListCodec) that SolrJ uses to communicate with Solr server is extremely optimized for our response format. However it is Java only. There are other projects such as Apache Thrift ( http://incubator.apache.org/thrift/) and Etch (both in incubation) which can be looked at. There are a few issues in Thrift which may help us in the future: https://issues.apache.org/jira/browse/THRIFT-110 https://issues.apache.org/jira/browse/THRIFT-122 On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd <Todd.Feak@...>wrote: > Reposting, as I inadvertently thread hijacked on the first one. My bad. > > Hi all, > > I have a handful of custom classes that we've created for our purposes > here. I'd like to share them if you think they have value for the rest > of the community, but I wanted to check here before creating JIRA > tickets and patches. > > Here's what I have: > > 1. DoubleMetaphoneFilter and Factory. This replaces usage of the > PhoneticFilter and Factory allowing access to set maxCodeLength() on the > DoubleMetaphone encoder and access to the "alternate" encodings that the > encoder provides for some words. > > 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and > Latin alphabet) exist in both a FullWidth and HalfWidth form. This > filter normalizes by switching to the FullWidth form for all the > characters. I have seen at least one JIRA ticket about this issue. This > implementation doesn't rely on Java 1.6. > > 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be > translated to Katakana. This filter normalizes to Katakana so that data > and queries can come in either way and get hits. > > > Also, I have been requested to create a prototype that you may be > interested in. I'm to construct a QueryResponseWriter that returns > documents using Google's Protocol Buffers. This would rely on an > existing patch that exposes the OutputStream, but I would like to start > the work soon. Are there license concerns that would block sharing this > with you? Is there any interest in this? > > Thanks for your consideration, > Todd Feak > -- Regards, Shalin Shekhar Mangar. |
|
|
Re: Offer to submit some custom enhancementsPython marshal format supports everything we need and is easy to implement
in Java. It is roughly equivalent to JSON, but binary. http://docs.python.org/library/marshal.html wunder On 10/16/08 8:16 AM, "Shalin Shekhar Mangar" <shalinmangar@...> wrote: > Hi Todd, > > AFAIK, protocol buffers cannot be used for Solr because it is unable to > support the NamedList structure that all Solr components use. > > The binary protocol (NamedListCodec) that SolrJ uses to communicate with > Solr server is extremely optimized for our response format. However it is > Java only. > > There are other projects such as Apache Thrift ( > http://incubator.apache.org/thrift/) and Etch (both in incubation) which can > be looked at. There are a few issues in Thrift which may help us in the > future: > > https://issues.apache.org/jira/browse/THRIFT-110 > https://issues.apache.org/jira/browse/THRIFT-122 > > On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd <Todd.Feak@...>wrote: > >> Reposting, as I inadvertently thread hijacked on the first one. My bad. >> >> Hi all, >> >> I have a handful of custom classes that we've created for our purposes >> here. I'd like to share them if you think they have value for the rest >> of the community, but I wanted to check here before creating JIRA >> tickets and patches. >> >> Here's what I have: >> >> 1. DoubleMetaphoneFilter and Factory. This replaces usage of the >> PhoneticFilter and Factory allowing access to set maxCodeLength() on the >> DoubleMetaphone encoder and access to the "alternate" encodings that the >> encoder provides for some words. >> >> 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and >> Latin alphabet) exist in both a FullWidth and HalfWidth form. This >> filter normalizes by switching to the FullWidth form for all the >> characters. I have seen at least one JIRA ticket about this issue. This >> implementation doesn't rely on Java 1.6. >> >> 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be >> translated to Katakana. This filter normalizes to Katakana so that data >> and queries can come in either way and get hits. >> >> >> Also, I have been requested to create a prototype that you may be >> interested in. I'm to construct a QueryResponseWriter that returns >> documents using Google's Protocol Buffers. This would rely on an >> existing patch that exposes the OutputStream, but I would like to start >> the work soon. Are there license concerns that would block sharing this >> with you? Is there any interest in this? >> >> Thanks for your consideration, >> Todd Feak >> > > |
|
|
RE: Offer to submit some custom enhancementsRegarding the location of the Filters and Factories ... I agree that the
Filters would be best located in Lucene, as users of both packages would then have access. What I'm struggling with is the timing of putting Filters into Lucene, and then Factories into Solr. The Factories in Solr would be useless until the Filters had been accepted and released in Lucene, then the Lucene version upgraded in Solr. What I'm inclined to do is release the Filters to both, and have the Factories point to the Solr version, until they become available in the Lucene version, then switch them over and drop the Solr version. How is this handled with other new Filter/Factory sets? Just let me know, and I'll get the ball rolling on those. I'm going to follow up on Protocol Buffers in response to some other messages I see coming in. Thanks, Todd Feak -----Original Message----- From: Grant Ingersoll [mailto:gsingers@...] Sent: Thursday, October 16, 2008 7:12 AM To: solr-dev@... Subject: Re: Offer to submit some custom enhancements Hi Todd, All of these sound good. Personally, I think analyzers like these belong in Lucene's contrib/analyzers package, with Solr factory implementations built on those, but that's your call. As for the Protocol Buffers, I am assuming you mean: http://code.google.com/p/protobuf/ That is an Apache license, so it is fine to incorporate. Sounds like it might be a contrib to start, but that's just my take. Sounds like they might be worth using in SolrJ and for distributed, but am interested in how it compares to other similar technologies. Can you share your use case for them? -Grant On Oct 15, 2008, at 2:48 PM, Feak, Todd wrote: > Reposting, as I inadvertently thread hijacked on the first one. My > bad. > > Hi all, > > I have a handful of custom classes that we've created for our purposes > here. I'd like to share them if you think they have value for the rest > of the community, but I wanted to check here before creating JIRA > tickets and patches. > > Here's what I have: > > 1. DoubleMetaphoneFilter and Factory. This replaces usage of the > PhoneticFilter and Factory allowing access to set maxCodeLength() on > the > DoubleMetaphone encoder and access to the "alternate" encodings that > the > encoder provides for some words. > > 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and > Latin alphabet) exist in both a FullWidth and HalfWidth form. This > filter normalizes by switching to the FullWidth form for all the > characters. I have seen at least one JIRA ticket about this issue. > This > implementation doesn't rely on Java 1.6. > > 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be > translated to Katakana. This filter normalizes to Katakana so that > data > and queries can come in either way and get hits. > > > Also, I have been requested to create a prototype that you may be > interested in. I'm to construct a QueryResponseWriter that returns > documents using Google's Protocol Buffers. This would rely on an > existing patch that exposes the OutputStream, but I would like to > start > the work soon. Are there license concerns that would block sharing > this > with you? Is there any interest in this? > > Thanks for your consideration, > Todd Feak -------------------------- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ |
|
|
RE: Offer to submit some custom enhancementsAnswering Grant Ingersoll's question for use case as well, which may
clarify. Without revealing TOO much about our internal structure, we are in the process of replacing SOAP communications in house with Protocol Buffers. We did evaluate Thrift as well, but decided on Protocol Buffers. A large effort for that conversion is well under way. I've been asked if Solr can support this, and to create a prototype to see if there are similar gains. I don't imagine it will be the gains that we've seen over SOAP, but I do foresee some amount of throughput increase. So, in response to suggestion for other binary formatting technologies, my hands are tied. This is the prototype I have to work on for now. If it works out, I will gladly share it. If not, I will share why, and hopefully save others some time. As for Protocol Buffers not supporting the NamedList structure. Google's documentation strongly suggests that intermediate (bean) classes be created, instead of trying to marshall and de-marshall your object model directly. This intermediate model doesn't have to precisely mirror the NamedList, it can be *any* compromise that gets the data from A to B, as long as the NamedList can be reconstituted on the other side. I'm sure something can be done. Thanks, Todd Feak -----Original Message----- From: Shalin Shekhar Mangar [mailto:shalinmangar@...] Sent: Thursday, October 16, 2008 8:17 AM To: solr-dev@... Subject: Re: Offer to submit some custom enhancements Hi Todd, AFAIK, protocol buffers cannot be used for Solr because it is unable to support the NamedList structure that all Solr components use. The binary protocol (NamedListCodec) that SolrJ uses to communicate with Solr server is extremely optimized for our response format. However it is Java only. There are other projects such as Apache Thrift ( http://incubator.apache.org/thrift/) and Etch (both in incubation) which can be looked at. There are a few issues in Thrift which may help us in the future: https://issues.apache.org/jira/browse/THRIFT-110 https://issues.apache.org/jira/browse/THRIFT-122 On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd <Todd.Feak@...>wrote: > Reposting, as I inadvertently thread hijacked on the first one. My bad. > > Hi all, > > I have a handful of custom classes that we've created for our purposes > here. I'd like to share them if you think they have value for the rest > of the community, but I wanted to check here before creating JIRA > tickets and patches. > > Here's what I have: > > 1. DoubleMetaphoneFilter and Factory. This replaces usage of the > PhoneticFilter and Factory allowing access to set maxCodeLength() on > DoubleMetaphone encoder and access to the "alternate" encodings that the > encoder provides for some words. > > 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and > Latin alphabet) exist in both a FullWidth and HalfWidth form. This > filter normalizes by switching to the FullWidth form for all the > characters. I have seen at least one JIRA ticket about this issue. This > implementation doesn't rely on Java 1.6. > > 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be > translated to Katakana. This filter normalizes to Katakana so that data > and queries can come in either way and get hits. > > > Also, I have been requested to create a prototype that you may be > interested in. I'm to construct a QueryResponseWriter that returns > documents using Google's Protocol Buffers. This would rely on an > existing patch that exposes the OutputStream, but I would like to start > the work soon. Are there license concerns that would block sharing this > with you? Is there any interest in this? > > Thanks for your consideration, > Todd Feak > -- Regards, Shalin Shekhar Mangar. |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |