adding and updating a lot of document to Solr, metadata extraction etc

View: New views
8 Messages — Rating Filter:   Alert me  

adding and updating a lot of document to Solr, metadata extraction etc

by bofh-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi there!

We are trying to evaluate Apache Solr for our custom search implementation, which
includes the following requirements:

- ability to add/update/delete a lot of documents at once

- ability to iterate over all documents, returned in search, as Lucene does
  provide within a HitCollector instance. We would need to extract and
  aggregate various fields, stored in index, to group results and aggregate them
  in some way.

After reading the tutorial I've realized that adding and removal of documents
is performed through passing an XML file to controller in POST request.
However our XML files may be very, very large - so I hope there is some
another option to avoid interaction through HTTP protocol.

Also I did not find any way in the tutorial to access the search results with
all fields to be processed by our application.

I think I simply did not read the documentation well or missed some point, so
can somebody please point me to the articles, which may explain basics of how
to achieve my goals?

Thank you very much in advance!

--
Eugene N Dzhurinsky


attachment0 (203 bytes) Download Attachment

Re: adding and updating a lot of document to Solr, metadata extraction etc

by Israel Ekpo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 30, 2009 at 11:23 AM, Eugene Dzhurinsky <bofh@...>wrote:

> Hi there!
>
> We are trying to evaluate Apache Solr for our custom search implementation,
> which
> includes the following requirements:
>
> - ability to add/update/delete a lot of documents at once
>
> - ability to iterate over all documents, returned in search, as Lucene does
>  provide within a HitCollector instance. We would need to extract and
>  aggregate various fields, stored in index, to group results and aggregate
> them
>  in some way.
>
> After reading the tutorial I've realized that adding and removal of
> documents
> is performed through passing an XML file to controller in POST request.
> However our XML files may be very, very large - so I hope there is some
> another option to avoid interaction through HTTP protocol.
>
> Also I did not find any way in the tutorial to access the search results
> with
> all fields to be processed by our application.
>
> I think I simply did not read the documentation well or missed some point,
> so
> can somebody please point me to the articles, which may explain basics of
> how
> to achieve my goals?
>
> Thank you very much in advance!
>
> --
> Eugene N Dzhurinsky
>

Hi Eugene

Solr has an embedded version but you are encouraged to use the standard web
service interfaces.

Also, the Solr 1.4 white paper just recently released talks about the the
Streaming Updates Solr Server which according to the white paper can index
documents at an incredibly lightening speed of up to 25K documents per
second.

The white paper can be downloaded here

http://www.lucidimagination.com/whitepaper/whats-new-in-solr-1-4

Info about Streaming Update Solr Server is available here

http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

If you are still interested in the Embedded version to avoid the HTTP
version you can check out the following links

http://wiki.apache.org/solr/EmbeddedSolr

http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.html

I hope this helps.

--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.

Re: adding and updating a lot of document to Solr, metadata extraction etc

by Alexey-34 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Eugene,

> - ability to iterate over all documents, returned in search, as Lucene does
>  provide within a HitCollector instance. We would need to extract and
>  aggregate various fields, stored in index, to group results and aggregate them
>  in some way.
> ....
> Also I did not find any way in the tutorial to access the search results with
> all fields to be processed by our application.
>
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Check out Faceted Search, probably you can achieve your goal by using
Facet Component

There's also Field Collapsing patch
http://wiki.apache.org/solr/FieldCollapsing


Alex

Re: adding and updating a lot of document to Solr, metadata extraction etc

by Lance Norskog-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

About large XML files and http overhead: you can tell solr to load the
file directly from a file system. This will stream thousands of
documents in one XML file without loading everything in memory at
once.

This is a new book on Solr. It will help you through this early learning phase.

http://www.packtpub.com/solr-1-4-enterprise-search-server

On Mon, Nov 2, 2009 at 6:24 AM, Alexey Serba <aserba@...> wrote:

> Hi Eugene,
>
>> - ability to iterate over all documents, returned in search, as Lucene does
>>  provide within a HitCollector instance. We would need to extract and
>>  aggregate various fields, stored in index, to group results and aggregate them
>>  in some way.
>> ....
>> Also I did not find any way in the tutorial to access the search results with
>> all fields to be processed by our application.
>>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
> Check out Faceted Search, probably you can achieve your goal by using
> Facet Component
>
> There's also Field Collapsing patch
> http://wiki.apache.org/solr/FieldCollapsing
>
>
> Alex
>



--
Lance Norskog
goksron@...

Re: adding and updating a lot of document to Solr, metadata extraction etc

by bofh-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 02, 2009 at 05:45:37PM -0800, Lance Norskog wrote:
> About large XML files and http overhead: you can tell solr to load the
> file directly from a file system. This will stream thousands of
> documents in one XML file without loading everything in memory at
> once.
>
> This is a new book on Solr. It will help you through this early learning phase.
>
> http://www.packtpub.com/solr-1-4-enterprise-search-server

Thank you, but we have to prepare some proof of concept with the stable
version. I didn't see any 1.4.0 artifacts released to repo1.maven.org for now.

Additionally, I've learned about http://wiki.apache.org/solr/DataImportHandler
and looks like this way is preferred in my case.

I do have a lot of HTML pages on disk storage, and some metadata being stored
in SQL tables. What I seem to need is to provide some sort of EntityProcessor
and DataSource to DataImportHandler. Additionally I will need to provide some
sort of properties to instruct data source for data retrieval (table names
etc).

So may be there is some tutorial or how-to, describing the process of creation
of custom classes for importing the data into Solr 1.3.0?

Thank you in advance!

--
Eugene N Dzhurinsky


attachment0 (203 bytes) Download Attachment

Re: adding and updating a lot of document to Solr, metadata extraction etc

by Lance Norskog-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The DIH has improved a great deal from Solr 1.3 to 1.4. You will be
much better off using the DIH from this.

This is the current Solr release candidate binary:
http://people.apache.org/~gsingers/solr/1.4.0/

On Tue, Nov 3, 2009 at 8:08 AM, Eugene Dzhurinsky <bofh@...> wrote:

> On Mon, Nov 02, 2009 at 05:45:37PM -0800, Lance Norskog wrote:
>> About large XML files and http overhead: you can tell solr to load the
>> file directly from a file system. This will stream thousands of
>> documents in one XML file without loading everything in memory at
>> once.
>>
>> This is a new book on Solr. It will help you through this early learning phase.
>>
>> http://www.packtpub.com/solr-1-4-enterprise-search-server
>
> Thank you, but we have to prepare some proof of concept with the stable
> version. I didn't see any 1.4.0 artifacts released to repo1.maven.org for now.
>
> Additionally, I've learned about http://wiki.apache.org/solr/DataImportHandler
> and looks like this way is preferred in my case.
>
> I do have a lot of HTML pages on disk storage, and some metadata being stored
> in SQL tables. What I seem to need is to provide some sort of EntityProcessor
> and DataSource to DataImportHandler. Additionally I will need to provide some
> sort of properties to instruct data source for data retrieval (table names
> etc).
>
> So may be there is some tutorial or how-to, describing the process of creation
> of custom classes for importing the data into Solr 1.3.0?
>
> Thank you in advance!
>
> --
> Eugene N Dzhurinsky
>



--
Lance Norskog
goksron@...

Re: adding and updating a lot of document to Solr, metadata extraction etc

by bofh-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 03, 2009 at 05:49:23PM -0800, Lance Norskog wrote:
> The DIH has improved a great deal from Solr 1.3 to 1.4. You will be
> much better off using the DIH from this.
>
> This is the current Solr release candidate binary:
> http://people.apache.org/~gsingers/solr/1.4.0/

In fact we are prohibited to use release candidates/nightly builds, we are
forced to use only releases of Solr :(

--
Eugene N Dzhurinsky


attachment0 (203 bytes) Download Attachment

Re: adding and updating a lot of document to Solr, metadata extraction etc

by Israel Ekpo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 10, 2009 at 8:26 AM, Eugene Dzhurinsky <bofh@...> wrote:

> On Tue, Nov 03, 2009 at 05:49:23PM -0800, Lance Norskog wrote:
> > The DIH has improved a great deal from Solr 1.3 to 1.4. You will be
> > much better off using the DIH from this.
> >
> > This is the current Solr release candidate binary:
> > http://people.apache.org/~gsingers/solr/1.4.0/<http://people.apache.org/%7Egsingers/solr/1.4.0/>
>
> In fact we are prohibited to use release candidates/nightly builds, we are
> forced to use only releases of Solr :(
>
> --
> Eugene N Dzhurinsky
>


Well, the official release is out and you can pick it up from your closest
mirror here

http://www.apache.org/dyn/closer.cgi/lucene/solr/


--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.