|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
adding and updating a lot of document to Solr, metadata extraction etcHi there!
We are trying to evaluate Apache Solr for our custom search implementation, which includes the following requirements: - ability to add/update/delete a lot of documents at once - ability to iterate over all documents, returned in search, as Lucene does provide within a HitCollector instance. We would need to extract and aggregate various fields, stored in index, to group results and aggregate them in some way. After reading the tutorial I've realized that adding and removal of documents is performed through passing an XML file to controller in POST request. However our XML files may be very, very large - so I hope there is some another option to avoid interaction through HTTP protocol. Also I did not find any way in the tutorial to access the search results with all fields to be processed by our application. I think I simply did not read the documentation well or missed some point, so can somebody please point me to the articles, which may explain basics of how to achieve my goals? Thank you very much in advance! -- Eugene N Dzhurinsky |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcOn Fri, Oct 30, 2009 at 11:23 AM, Eugene Dzhurinsky <bofh@...>wrote:
> Hi there! > > We are trying to evaluate Apache Solr for our custom search implementation, > which > includes the following requirements: > > - ability to add/update/delete a lot of documents at once > > - ability to iterate over all documents, returned in search, as Lucene does > provide within a HitCollector instance. We would need to extract and > aggregate various fields, stored in index, to group results and aggregate > them > in some way. > > After reading the tutorial I've realized that adding and removal of > documents > is performed through passing an XML file to controller in POST request. > However our XML files may be very, very large - so I hope there is some > another option to avoid interaction through HTTP protocol. > > Also I did not find any way in the tutorial to access the search results > with > all fields to be processed by our application. > > I think I simply did not read the documentation well or missed some point, > so > can somebody please point me to the articles, which may explain basics of > how > to achieve my goals? > > Thank you very much in advance! > > -- > Eugene N Dzhurinsky > Hi Eugene Solr has an embedded version but you are encouraged to use the standard web service interfaces. Also, the Solr 1.4 white paper just recently released talks about the the Streaming Updates Solr Server which according to the white paper can index documents at an incredibly lightening speed of up to 25K documents per second. The white paper can be downloaded here http://www.lucidimagination.com/whitepaper/whats-new-in-solr-1-4 Info about Streaming Update Solr Server is available here http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html If you are still interested in the Embedded version to avoid the HTTP version you can check out the following links http://wiki.apache.org/solr/EmbeddedSolr http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.html I hope this helps. -- "Good Enough" is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcHi Eugene,
> - ability to iterate over all documents, returned in search, as Lucene does > provide within a HitCollector instance. We would need to extract and > aggregate various fields, stored in index, to group results and aggregate them > in some way. > .... > Also I did not find any way in the tutorial to access the search results with > all fields to be processed by our application. > http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr Check out Faceted Search, probably you can achieve your goal by using Facet Component There's also Field Collapsing patch http://wiki.apache.org/solr/FieldCollapsing Alex |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcAbout large XML files and http overhead: you can tell solr to load the
file directly from a file system. This will stream thousands of documents in one XML file without loading everything in memory at once. This is a new book on Solr. It will help you through this early learning phase. http://www.packtpub.com/solr-1-4-enterprise-search-server On Mon, Nov 2, 2009 at 6:24 AM, Alexey Serba <aserba@...> wrote: > Hi Eugene, > >> - ability to iterate over all documents, returned in search, as Lucene does >> provide within a HitCollector instance. We would need to extract and >> aggregate various fields, stored in index, to group results and aggregate them >> in some way. >> .... >> Also I did not find any way in the tutorial to access the search results with >> all fields to be processed by our application. >> > http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr > Check out Faceted Search, probably you can achieve your goal by using > Facet Component > > There's also Field Collapsing patch > http://wiki.apache.org/solr/FieldCollapsing > > > Alex > -- Lance Norskog goksron@... |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcOn Mon, Nov 02, 2009 at 05:45:37PM -0800, Lance Norskog wrote:
> About large XML files and http overhead: you can tell solr to load the > file directly from a file system. This will stream thousands of > documents in one XML file without loading everything in memory at > once. > > This is a new book on Solr. It will help you through this early learning phase. > > http://www.packtpub.com/solr-1-4-enterprise-search-server Thank you, but we have to prepare some proof of concept with the stable version. I didn't see any 1.4.0 artifacts released to repo1.maven.org for now. Additionally, I've learned about http://wiki.apache.org/solr/DataImportHandler and looks like this way is preferred in my case. I do have a lot of HTML pages on disk storage, and some metadata being stored in SQL tables. What I seem to need is to provide some sort of EntityProcessor and DataSource to DataImportHandler. Additionally I will need to provide some sort of properties to instruct data source for data retrieval (table names etc). So may be there is some tutorial or how-to, describing the process of creation of custom classes for importing the data into Solr 1.3.0? Thank you in advance! -- Eugene N Dzhurinsky |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcThe DIH has improved a great deal from Solr 1.3 to 1.4. You will be
much better off using the DIH from this. This is the current Solr release candidate binary: http://people.apache.org/~gsingers/solr/1.4.0/ On Tue, Nov 3, 2009 at 8:08 AM, Eugene Dzhurinsky <bofh@...> wrote: > On Mon, Nov 02, 2009 at 05:45:37PM -0800, Lance Norskog wrote: >> About large XML files and http overhead: you can tell solr to load the >> file directly from a file system. This will stream thousands of >> documents in one XML file without loading everything in memory at >> once. >> >> This is a new book on Solr. It will help you through this early learning phase. >> >> http://www.packtpub.com/solr-1-4-enterprise-search-server > > Thank you, but we have to prepare some proof of concept with the stable > version. I didn't see any 1.4.0 artifacts released to repo1.maven.org for now. > > Additionally, I've learned about http://wiki.apache.org/solr/DataImportHandler > and looks like this way is preferred in my case. > > I do have a lot of HTML pages on disk storage, and some metadata being stored > in SQL tables. What I seem to need is to provide some sort of EntityProcessor > and DataSource to DataImportHandler. Additionally I will need to provide some > sort of properties to instruct data source for data retrieval (table names > etc). > > So may be there is some tutorial or how-to, describing the process of creation > of custom classes for importing the data into Solr 1.3.0? > > Thank you in advance! > > -- > Eugene N Dzhurinsky > -- Lance Norskog goksron@... |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcOn Tue, Nov 03, 2009 at 05:49:23PM -0800, Lance Norskog wrote:
> The DIH has improved a great deal from Solr 1.3 to 1.4. You will be > much better off using the DIH from this. > > This is the current Solr release candidate binary: > http://people.apache.org/~gsingers/solr/1.4.0/ In fact we are prohibited to use release candidates/nightly builds, we are forced to use only releases of Solr :( -- Eugene N Dzhurinsky |
|
|
Re: adding and updating a lot of document to Solr, metadata extraction etcOn Tue, Nov 10, 2009 at 8:26 AM, Eugene Dzhurinsky <bofh@...> wrote:
> On Tue, Nov 03, 2009 at 05:49:23PM -0800, Lance Norskog wrote: > > The DIH has improved a great deal from Solr 1.3 to 1.4. You will be > > much better off using the DIH from this. > > > > This is the current Solr release candidate binary: > > http://people.apache.org/~gsingers/solr/1.4.0/<http://people.apache.org/%7Egsingers/solr/1.4.0/> > > In fact we are prohibited to use release candidates/nightly builds, we are > forced to use only releases of Solr :( > > -- > Eugene N Dzhurinsky > Well, the official release is out and you can pick it up from your closest mirror here http://www.apache.org/dyn/closer.cgi/lucene/solr/ -- "Good Enough" is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. |
| Free embeddable forum powered by Nabble | Forum Help |