Using lucene as a database... good idea or bad idea?

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Using lucene as a database... good idea or bad idea?

by John Evans-10 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi All,

I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites.  Now I am tasked with developing a
data-store to back a web crawler.  The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields.  It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good.  I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance?  Has anyone done something similar?  Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale.  We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).  We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
appreciated.

Thanks,
-
John

Re: Using lucene as a database... good idea or bad idea?

by Hasan Diwan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Check the nutch or solr projects, both of which are subprojects of lucene. Feel free to drop me a line if you should run into difficulties.
Sent via BlackBerry by AT&T

-----Original Message-----
From: "John Evans" <john@...>

Date: Mon, 28 Jul 2008 18:53:08
To: <java-user@...>
Subject: Using lucene as a database... good idea or bad idea?


Hi All,

I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites.  Now I am tasked with developing a
data-store to back a web crawler.  The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields.  It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good.  I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance?  Has anyone done something similar?  Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale.  We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).  We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
appreciated.

Thanks,
-
John


Re: Using lucene as a database... good idea or bad idea?

by Ian Lea :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John


I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days).  Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling.  There are certainly lucene indexes out there bigger
than what you propose.  You can compress the stored data to save some
space.  Run times for optimization might get interesting but see
recent threads for suggestions on that.  And since you are not too
concerned about performance you may not need to optimize much, or even
at all.

Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.


--
Ian.


On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:

> Hi All,
>
> I have successfully used Lucene in the "tradtiional" way to provide
> full-text search for various websites.  Now I am tasked with developing a
> data-store to back a web crawler.  The crawler can be configured to retrieve
> arbitrary fields from arbitrary pages, so the result is that each document
> may have a random assortment of fields.  It seems like Lucene may be a
> natural fit for this scenario since you can obviously add arbitrary fields
> to each document and you can store the actually data in the database. I've
> done some research to make sure that it would meet all of our individual
> requirements (that we can iterate over documents, update (delete/replace)
> documents, etc.) and everything looks good.  I've also seen a couple of
> references around the net to other people trying similar things... however,
> I know it's not meant to be used this way, so I thought I would post here
> and ask for guidance?  Has anyone done something similar?  Is there any
> specific reason to think this is a bad idea?
>
> The one thing that I am least certain about his how well it will scale.  We
> may reach the point where we have tens of millions of documents and a high
> percentage of those documents may be relatively large (10k-50k each).  We
> actually would NOT be expecting/needing Lucene's normal extreme fast text
> search times for this, but we would need reasonable times for adding new
> documents to the index, retrieving documents by ID (for iterating over all
> documents), optimizing the index after a series of changes, etc.
>
> Any advice/input/theories anyone can contribute would be greatly
> appreciated.
>
> Thanks,
> -
> John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by Ganesh - yahoo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello all,

I am also interested in this. I want to archive the content of the document
using Lucene.

Is it a good idea to use Lucene as storage engine?

Regards
Ganesh

----- Original Message -----
From: "Ian Lea" <ian.lea@...>
To: <java-user@...>
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?


> John
>
>
> I think it's a great idea, and do exactly this to store 5 million+
> documents with info that it takes way too long to get out of our
> Oracle database (think days).  Not as many docs as you are talking
> about, and less data for each doc, but I wouldn't have any concerns
> about scaling.  There are certainly lucene indexes out there bigger
> than what you propose.  You can compress the stored data to save some
> space.  Run times for optimization might get interesting but see
> recent threads for suggestions on that.  And since you are not too
> concerned about performance you may not need to optimize much, or even
> at all.
>
> Of course you need to remember that this is not a DBMS solution in the
> sense of transactions, recovery, etc. but I'm sure you are already
> aware of that.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
>> Hi All,
>>
>> I have successfully used Lucene in the "tradtiional" way to provide
>> full-text search for various websites.  Now I am tasked with developing a
>> data-store to back a web crawler.  The crawler can be configured to
>> retrieve
>> arbitrary fields from arbitrary pages, so the result is that each
>> document
>> may have a random assortment of fields.  It seems like Lucene may be a
>> natural fit for this scenario since you can obviously add arbitrary
>> fields
>> to each document and you can store the actually data in the database.
>> I've
>> done some research to make sure that it would meet all of our individual
>> requirements (that we can iterate over documents, update (delete/replace)
>> documents, etc.) and everything looks good.  I've also seen a couple of
>> references around the net to other people trying similar things...
>> however,
>> I know it's not meant to be used this way, so I thought I would post here
>> and ask for guidance?  Has anyone done something similar?  Is there any
>> specific reason to think this is a bad idea?
>>
>> The one thing that I am least certain about his how well it will scale.
>> We
>> may reach the point where we have tens of millions of documents and a
>> high
>> percentage of those documents may be relatively large (10k-50k each).  We
>> actually would NOT be expecting/needing Lucene's normal extreme fast text
>> search times for this, but we would need reasonable times for adding new
>> documents to the index, retrieving documents by ID (for iterating over
>> all
>> documents), optimizing the index after a series of changes, etc.
>>
>> Any advice/input/theories anyone can contribute would be greatly
>> appreciated.
>>
>> Thanks,
>> -
>> John
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I think the answer is it can be done and probably quite well.  I also  
think it's informative that Nutch does not use Lucene for this  
function, as I understand it, but that shouldn't stop you either.  You  
might also have a look at Apache Jackrabbit, which uses Lucene  
underneath as a content repository.

-Grant

On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:

> Hello all,
>
> I am also interested in this. I want to archive the content of the  
> document using Lucene.
>
> Is it a good idea to use Lucene as storage engine?
>
> Regards
> Ganesh
>
> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
> To: <java-user@...>
> Sent: Tuesday, July 29, 2008 2:18 PM
> Subject: Re: Using lucene as a database... good idea or bad idea?
>
>
>> John
>>
>>
>> I think it's a great idea, and do exactly this to store 5 million+
>> documents with info that it takes way too long to get out of our
>> Oracle database (think days).  Not as many docs as you are talking
>> about, and less data for each doc, but I wouldn't have any concerns
>> about scaling.  There are certainly lucene indexes out there bigger
>> than what you propose.  You can compress the stored data to save some
>> space.  Run times for optimization might get interesting but see
>> recent threads for suggestions on that.  And since you are not too
>> concerned about performance you may not need to optimize much, or  
>> even
>> at all.
>>
>> Of course you need to remember that this is not a DBMS solution in  
>> the
>> sense of transactions, recovery, etc. but I'm sure you are already
>> aware of that.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
>>> Hi All,
>>>
>>> I have successfully used Lucene in the "tradtiional" way to provide
>>> full-text search for various websites.  Now I am tasked with  
>>> developing a
>>> data-store to back a web crawler.  The crawler can be configured  
>>> to retrieve
>>> arbitrary fields from arbitrary pages, so the result is that each  
>>> document
>>> may have a random assortment of fields.  It seems like Lucene may  
>>> be a
>>> natural fit for this scenario since you can obviously add  
>>> arbitrary fields
>>> to each document and you can store the actually data in the  
>>> database. I've
>>> done some research to make sure that it would meet all of our  
>>> individual
>>> requirements (that we can iterate over documents, update (delete/
>>> replace)
>>> documents, etc.) and everything looks good.  I've also seen a  
>>> couple of
>>> references around the net to other people trying similar things...  
>>> however,
>>> I know it's not meant to be used this way, so I thought I would  
>>> post here
>>> and ask for guidance?  Has anyone done something similar?  Is  
>>> there any
>>> specific reason to think this is a bad idea?
>>>
>>> The one thing that I am least certain about his how well it will  
>>> scale. We
>>> may reach the point where we have tens of millions of documents  
>>> and a high
>>> percentage of those documents may be relatively large (10k-50k  
>>> each).  We
>>> actually would NOT be expecting/needing Lucene's normal extreme  
>>> fast text
>>> search times for this, but we would need reasonable times for  
>>> adding new
>>> documents to the index, retrieving documents by ID (for iterating  
>>> over all
>>> documents), optimizing the index after a series of changes, etc.
>>>
>>> Any advice/input/theories anyone can contribute would be greatly
>>> appreciated.
>>>
>>> Thanks,
>>> -
>>> John
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The way I see it, search solutions (on whatever scale) have three components
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.

So, I do not quite follow, why should Lucene be used for datastore ?

Nagesh

On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <gsingers@...>wrote:

> I think the answer is it can be done and probably quite well.  I also think
> it's informative that Nutch does not use Lucene for this function, as I
> understand it, but that shouldn't stop you either.  You might also have a
> look at Apache Jackrabbit, which uses Lucene underneath as a content
> repository.
>
> -Grant
>
>
> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>
>  Hello all,
>>
>> I am also interested in this. I want to archive the content of the
>> document using Lucene.
>>
>> Is it a good idea to use Lucene as storage engine?
>>
>> Regards
>> Ganesh
>>
>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
>> To: <java-user@...>
>> Sent: Tuesday, July 29, 2008 2:18 PM
>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>
>>
>>  John
>>>
>>>
>>> I think it's a great idea, and do exactly this to store 5 million+
>>> documents with info that it takes way too long to get out of our
>>> Oracle database (think days).  Not as many docs as you are talking
>>> about, and less data for each doc, but I wouldn't have any concerns
>>> about scaling.  There are certainly lucene indexes out there bigger
>>> than what you propose.  You can compress the stored data to save some
>>> space.  Run times for optimization might get interesting but see
>>> recent threads for suggestions on that.  And since you are not too
>>> concerned about performance you may not need to optimize much, or even
>>> at all.
>>>
>>> Of course you need to remember that this is not a DBMS solution in the
>>> sense of transactions, recovery, etc. but I'm sure you are already
>>> aware of that.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have successfully used Lucene in the "tradtiional" way to provide
>>>> full-text search for various websites.  Now I am tasked with developing
>>>> a
>>>> data-store to back a web crawler.  The crawler can be configured to
>>>> retrieve
>>>> arbitrary fields from arbitrary pages, so the result is that each
>>>> document
>>>> may have a random assortment of fields.  It seems like Lucene may be a
>>>> natural fit for this scenario since you can obviously add arbitrary
>>>> fields
>>>> to each document and you can store the actually data in the database.
>>>> I've
>>>> done some research to make sure that it would meet all of our individual
>>>> requirements (that we can iterate over documents, update
>>>> (delete/replace)
>>>> documents, etc.) and everything looks good.  I've also seen a couple of
>>>> references around the net to other people trying similar things...
>>>> however,
>>>> I know it's not meant to be used this way, so I thought I would post
>>>> here
>>>> and ask for guidance?  Has anyone done something similar?  Is there any
>>>> specific reason to think this is a bad idea?
>>>>
>>>> The one thing that I am least certain about his how well it will scale.
>>>> We
>>>> may reach the point where we have tens of millions of documents and a
>>>> high
>>>> percentage of those documents may be relatively large (10k-50k each).
>>>>  We
>>>> actually would NOT be expecting/needing Lucene's normal extreme fast
>>>> text
>>>> search times for this, but we would need reasonable times for adding new
>>>> documents to the index, retrieving documents by ID (for iterating over
>>>> all
>>>> documents), optimizing the index after a series of changes, etc.
>>>>
>>>> Any advice/input/theories anyone can contribute would be greatly
>>>> appreciated.
>>>>
>>>> Thanks,
>>>> -
>>>> John
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>
>> Send instant messages to your online friends
>> http://in.messenger.yahoo.com
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

Re: Using lucene as a database... good idea or bad idea?

by Ian Lea :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO).  Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.


--
Ian.


On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
<nageshblore@...> wrote:

> The way I see it, search solutions (on whatever scale) have three components
> - data aggregation, indexing/searching and presentation of results. I
> thought, Lucene did the second part only.
>
> So, I do not quite follow, why should Lucene be used for datastore ?
>
> Nagesh
>
> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <gsingers@...>wrote:
>
>> I think the answer is it can be done and probably quite well.  I also think
>> it's informative that Nutch does not use Lucene for this function, as I
>> understand it, but that shouldn't stop you either.  You might also have a
>> look at Apache Jackrabbit, which uses Lucene underneath as a content
>> repository.
>>
>> -Grant
>>
>>
>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>
>>  Hello all,
>>>
>>> I am also interested in this. I want to archive the content of the
>>> document using Lucene.
>>>
>>> Is it a good idea to use Lucene as storage engine?
>>>
>>> Regards
>>> Ganesh
>>>
>>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
>>> To: <java-user@...>
>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>
>>>
>>>  John
>>>>
>>>>
>>>> I think it's a great idea, and do exactly this to store 5 million+
>>>> documents with info that it takes way too long to get out of our
>>>> Oracle database (think days).  Not as many docs as you are talking
>>>> about, and less data for each doc, but I wouldn't have any concerns
>>>> about scaling.  There are certainly lucene indexes out there bigger
>>>> than what you propose.  You can compress the stored data to save some
>>>> space.  Run times for optimization might get interesting but see
>>>> recent threads for suggestions on that.  And since you are not too
>>>> concerned about performance you may not need to optimize much, or even
>>>> at all.
>>>>
>>>> Of course you need to remember that this is not a DBMS solution in the
>>>> sense of transactions, recovery, etc. but I'm sure you are already
>>>> aware of that.
>>>>
>>>>
>>>> --
>>>> Ian.
>>>>
>>>>
>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have successfully used Lucene in the "tradtiional" way to provide
>>>>> full-text search for various websites.  Now I am tasked with developing
>>>>> a
>>>>> data-store to back a web crawler.  The crawler can be configured to
>>>>> retrieve
>>>>> arbitrary fields from arbitrary pages, so the result is that each
>>>>> document
>>>>> may have a random assortment of fields.  It seems like Lucene may be a
>>>>> natural fit for this scenario since you can obviously add arbitrary
>>>>> fields
>>>>> to each document and you can store the actually data in the database.
>>>>> I've
>>>>> done some research to make sure that it would meet all of our individual
>>>>> requirements (that we can iterate over documents, update
>>>>> (delete/replace)
>>>>> documents, etc.) and everything looks good.  I've also seen a couple of
>>>>> references around the net to other people trying similar things...
>>>>> however,
>>>>> I know it's not meant to be used this way, so I thought I would post
>>>>> here
>>>>> and ask for guidance?  Has anyone done something similar?  Is there any
>>>>> specific reason to think this is a bad idea?
>>>>>
>>>>> The one thing that I am least certain about his how well it will scale.
>>>>> We
>>>>> may reach the point where we have tens of millions of documents and a
>>>>> high
>>>>> percentage of those documents may be relatively large (10k-50k each).
>>>>>  We
>>>>> actually would NOT be expecting/needing Lucene's normal extreme fast
>>>>> text
>>>>> search times for this, but we would need reasonable times for adding new
>>>>> documents to the index, retrieving documents by ID (for iterating over
>>>>> all
>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>
>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>> appreciated.
>>>>>
>>>>> Thanks,
>>>>> -
>>>>> John
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>> For additional commands, e-mail: java-user-help@...
>>>>
>>>
>>> Send instant messages to your online friends
>>> http://in.messenger.yahoo.com
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>>
>>
>

Re: Using lucene as a database... good idea or bad idea?

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Agreed, no one is saying should.  Additionally, Lucene can be faster  
for a number of things like storage when databases are overkill (i.e.  
you don't need transactions, complex joins, etc.)  After all, even the  
lookup of a file, can be viewed as a "search", even if it is just for  
a single unique key and doesn't require any fuzziness.


On Jul 29, 2008, at 9:21 AM, Ian Lea wrote:

> I don't think that anyone in this thread has said "should", just
> "could" - it is a valid option (IMHO).  Personally, I use it as a
> store for lucene related data because I know and like and trust it, it
> is already there for this project so no need to introduce another
> software dependency, and because it is blindingly fast.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್  
> ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
> <nageshblore@...> wrote:
>> The way I see it, search solutions (on whatever scale) have three  
>> components
>> - data aggregation, indexing/searching and presentation of results. I
>> thought, Lucene did the second part only.
>>
>> So, I do not quite follow, why should Lucene be used for datastore ?
>>
>> Nagesh
>>
>> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll  
>> <gsingers@...>wrote:
>>
>>> I think the answer is it can be done and probably quite well.  I  
>>> also think
>>> it's informative that Nutch does not use Lucene for this function,  
>>> as I
>>> understand it, but that shouldn't stop you either.  You might also  
>>> have a
>>> look at Apache Jackrabbit, which uses Lucene underneath as a content
>>> repository.
>>>
>>> -Grant
>>>
>>>
>>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>>
>>> Hello all,
>>>>
>>>> I am also interested in this. I want to archive the content of the
>>>> document using Lucene.
>>>>
>>>> Is it a good idea to use Lucene as storage engine?
>>>>
>>>> Regards
>>>> Ganesh
>>>>
>>>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
>>>> To: <java-user@...>
>>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>>
>>>>
>>>> John
>>>>>
>>>>>
>>>>> I think it's a great idea, and do exactly this to store 5 million+
>>>>> documents with info that it takes way too long to get out of our
>>>>> Oracle database (think days).  Not as many docs as you are talking
>>>>> about, and less data for each doc, but I wouldn't have any  
>>>>> concerns
>>>>> about scaling.  There are certainly lucene indexes out there  
>>>>> bigger
>>>>> than what you propose.  You can compress the stored data to save  
>>>>> some
>>>>> space.  Run times for optimization might get interesting but see
>>>>> recent threads for suggestions on that.  And since you are not too
>>>>> concerned about performance you may not need to optimize much,  
>>>>> or even
>>>>> at all.
>>>>>
>>>>> Of course you need to remember that this is not a DBMS solution  
>>>>> in the
>>>>> sense of transactions, recovery, etc. but I'm sure you are already
>>>>> aware of that.
>>>>>
>>>>>
>>>>> --
>>>>> Ian.
>>>>>
>>>>>
>>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...>  
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have successfully used Lucene in the "tradtiional" way to  
>>>>>> provide
>>>>>> full-text search for various websites.  Now I am tasked with  
>>>>>> developing
>>>>>> a
>>>>>> data-store to back a web crawler.  The crawler can be  
>>>>>> configured to
>>>>>> retrieve
>>>>>> arbitrary fields from arbitrary pages, so the result is that each
>>>>>> document
>>>>>> may have a random assortment of fields.  It seems like Lucene  
>>>>>> may be a
>>>>>> natural fit for this scenario since you can obviously add  
>>>>>> arbitrary
>>>>>> fields
>>>>>> to each document and you can store the actually data in the  
>>>>>> database.
>>>>>> I've
>>>>>> done some research to make sure that it would meet all of our  
>>>>>> individual
>>>>>> requirements (that we can iterate over documents, update
>>>>>> (delete/replace)
>>>>>> documents, etc.) and everything looks good.  I've also seen a  
>>>>>> couple of
>>>>>> references around the net to other people trying similar  
>>>>>> things...
>>>>>> however,
>>>>>> I know it's not meant to be used this way, so I thought I would  
>>>>>> post
>>>>>> here
>>>>>> and ask for guidance?  Has anyone done something similar?  Is  
>>>>>> there any
>>>>>> specific reason to think this is a bad idea?
>>>>>>
>>>>>> The one thing that I am least certain about his how well it  
>>>>>> will scale.
>>>>>> We
>>>>>> may reach the point where we have tens of millions of documents  
>>>>>> and a
>>>>>> high
>>>>>> percentage of those documents may be relatively large (10k-50k  
>>>>>> each).
>>>>>> We
>>>>>> actually would NOT be expecting/needing Lucene's normal extreme  
>>>>>> fast
>>>>>> text
>>>>>> search times for this, but we would need reasonable times for  
>>>>>> adding new
>>>>>> documents to the index, retrieving documents by ID (for  
>>>>>> iterating over
>>>>>> all
>>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>>
>>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> -
>>>>>> John
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>> For additional commands, e-mail: java-user-help@...
>>>>>
>>>>
>>>> Send instant messages to your online friends
>>>> http://in.messenger.yahoo.com
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>> For additional commands, e-mail: java-user-help@...
>>>>
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Ian,
Yes, I see that we are discussing an "option" here.

But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.

Basically, I am not able to connect "database" and Lucene java. :)

Nagesh


On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <ian.lea@...> wrote:

> I don't think that anyone in this thread has said "should", just
> "could" - it is a valid option (IMHO).  Personally, I use it as a
> store for lucene related data because I know and like and trust it, it
> is already there for this project so no need to introduce another
> software dependency, and because it is blindingly fast.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
> <nageshblore@...> wrote:
> > The way I see it, search solutions (on whatever scale) have three
> components
> > - data aggregation, indexing/searching and presentation of results. I
> > thought, Lucene did the second part only.
> >
> > So, I do not quite follow, why should Lucene be used for datastore ?
> >
> > Nagesh
> >
> > On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <gsingers@...
> >wrote:
> >
> >> I think the answer is it can be done and probably quite well.  I also
> think
> >> it's informative that Nutch does not use Lucene for this function, as I
> >> understand it, but that shouldn't stop you either.  You might also have
> a
> >> look at Apache Jackrabbit, which uses Lucene underneath as a content
> >> repository.
> >>
> >> -Grant
> >>
> >>
> >> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
> >>
> >>  Hello all,
> >>>
> >>> I am also interested in this. I want to archive the content of the
> >>> document using Lucene.
> >>>
> >>> Is it a good idea to use Lucene as storage engine?
> >>>
> >>> Regards
> >>> Ganesh
> >>>
> >>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
> >>> To: <java-user@...>
> >>> Sent: Tuesday, July 29, 2008 2:18 PM
> >>> Subject: Re: Using lucene as a database... good idea or bad idea?
> >>>
> >>>
> >>>  John
> >>>>
> >>>>
> >>>> I think it's a great idea, and do exactly this to store 5 million+
> >>>> documents with info that it takes way too long to get out of our
> >>>> Oracle database (think days).  Not as many docs as you are talking
> >>>> about, and less data for each doc, but I wouldn't have any concerns
> >>>> about scaling.  There are certainly lucene indexes out there bigger
> >>>> than what you propose.  You can compress the stored data to save some
> >>>> space.  Run times for optimization might get interesting but see
> >>>> recent threads for suggestions on that.  And since you are not too
> >>>> concerned about performance you may not need to optimize much, or even
> >>>> at all.
> >>>>
> >>>> Of course you need to remember that this is not a DBMS solution in the
> >>>> sense of transactions, recovery, etc. but I'm sure you are already
> >>>> aware of that.
> >>>>
> >>>>
> >>>> --
> >>>> Ian.
> >>>>
> >>>>
> >>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> I have successfully used Lucene in the "tradtiional" way to provide
> >>>>> full-text search for various websites.  Now I am tasked with
> developing
> >>>>> a
> >>>>> data-store to back a web crawler.  The crawler can be configured to
> >>>>> retrieve
> >>>>> arbitrary fields from arbitrary pages, so the result is that each
> >>>>> document
> >>>>> may have a random assortment of fields.  It seems like Lucene may be
> a
> >>>>> natural fit for this scenario since you can obviously add arbitrary
> >>>>> fields
> >>>>> to each document and you can store the actually data in the database.
> >>>>> I've
> >>>>> done some research to make sure that it would meet all of our
> individual
> >>>>> requirements (that we can iterate over documents, update
> >>>>> (delete/replace)
> >>>>> documents, etc.) and everything looks good.  I've also seen a couple
> of
> >>>>> references around the net to other people trying similar things...
> >>>>> however,
> >>>>> I know it's not meant to be used this way, so I thought I would post
> >>>>> here
> >>>>> and ask for guidance?  Has anyone done something similar?  Is there
> any
> >>>>> specific reason to think this is a bad idea?
> >>>>>
> >>>>> The one thing that I am least certain about his how well it will
> scale.
> >>>>> We
> >>>>> may reach the point where we have tens of millions of documents and a
> >>>>> high
> >>>>> percentage of those documents may be relatively large (10k-50k each).
> >>>>>  We
> >>>>> actually would NOT be expecting/needing Lucene's normal extreme fast
> >>>>> text
> >>>>> search times for this, but we would need reasonable times for adding
> new
> >>>>> documents to the index, retrieving documents by ID (for iterating
> over
> >>>>> all
> >>>>> documents), optimizing the index after a series of changes, etc.
> >>>>>
> >>>>> Any advice/input/theories anyone can contribute would be greatly
> >>>>> appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>> -
> >>>>> John
> >>>>>
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@...
> >>>> For additional commands, e-mail: java-user-help@...
> >>>>
> >>>
> >>> Send instant messages to your online friends
> >>> http://in.messenger.yahoo.com
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@...
> >>> For additional commands, e-mail: java-user-help@...
> >>>
> >>>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@...
> >> For additional commands, e-mail: java-user-help@...
> >>
> >>
> >
>

Re: Using lucene as a database... good idea or bad idea?

by yarram :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Look at Compass wrapper for Lucene...

Regards,
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: aravind.yarram@...



"ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)" <nageshblore@...>
07/29/2008 10:02 AM
Please respond to
java-user@...


To
java-user@...
cc

Subject
Re: Using lucene as a database... good idea or bad idea?






Hi Ian,
Yes, I see that we are discussing an "option" here.

But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.

Basically, I am not able to connect "database" and Lucene java. :)

Nagesh


On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <ian.lea@...> wrote:

> I don't think that anyone in this thread has said "should", just
> "could" - it is a valid option (IMHO).  Personally, I use it as a
> store for lucene related data because I know and like and trust it, it
> is already there for this project so no need to introduce another
> software dependency, and because it is blindingly fast.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ
(Nagesh S)

> <nageshblore@...> wrote:
> > The way I see it, search solutions (on whatever scale) have three
> components
> > - data aggregation, indexing/searching and presentation of results. I
> > thought, Lucene did the second part only.
> >
> > So, I do not quite follow, why should Lucene be used for datastore ?
> >
> > Nagesh
> >
> > On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <gsingers@...
> >wrote:
> >
> >> I think the answer is it can be done and probably quite well.  I also
> think
> >> it's informative that Nutch does not use Lucene for this function, as
I
> >> understand it, but that shouldn't stop you either.  You might also
have

> a
> >> look at Apache Jackrabbit, which uses Lucene underneath as a content
> >> repository.
> >>
> >> -Grant
> >>
> >>
> >> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
> >>
> >>  Hello all,
> >>>
> >>> I am also interested in this. I want to archive the content of the
> >>> document using Lucene.
> >>>
> >>> Is it a good idea to use Lucene as storage engine?
> >>>
> >>> Regards
> >>> Ganesh
> >>>
> >>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
> >>> To: <java-user@...>
> >>> Sent: Tuesday, July 29, 2008 2:18 PM
> >>> Subject: Re: Using lucene as a database... good idea or bad idea?
> >>>
> >>>
> >>>  John
> >>>>
> >>>>
> >>>> I think it's a great idea, and do exactly this to store 5 million+
> >>>> documents with info that it takes way too long to get out of our
> >>>> Oracle database (think days).  Not as many docs as you are talking
> >>>> about, and less data for each doc, but I wouldn't have any concerns
> >>>> about scaling.  There are certainly lucene indexes out there bigger
> >>>> than what you propose.  You can compress the stored data to save
some
> >>>> space.  Run times for optimization might get interesting but see
> >>>> recent threads for suggestions on that.  And since you are not too
> >>>> concerned about performance you may not need to optimize much, or
even
> >>>> at all.
> >>>>
> >>>> Of course you need to remember that this is not a DBMS solution in
the
> >>>> sense of transactions, recovery, etc. but I'm sure you are already
> >>>> aware of that.
> >>>>
> >>>>
> >>>> --
> >>>> Ian.
> >>>>
> >>>>
> >>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...>
wrote:
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> I have successfully used Lucene in the "tradtiional" way to
provide
> >>>>> full-text search for various websites.  Now I am tasked with
> developing
> >>>>> a
> >>>>> data-store to back a web crawler.  The crawler can be configured
to
> >>>>> retrieve
> >>>>> arbitrary fields from arbitrary pages, so the result is that each
> >>>>> document
> >>>>> may have a random assortment of fields.  It seems like Lucene may
be
> a
> >>>>> natural fit for this scenario since you can obviously add
arbitrary
> >>>>> fields
> >>>>> to each document and you can store the actually data in the
database.
> >>>>> I've
> >>>>> done some research to make sure that it would meet all of our
> individual
> >>>>> requirements (that we can iterate over documents, update
> >>>>> (delete/replace)
> >>>>> documents, etc.) and everything looks good.  I've also seen a
couple
> of
> >>>>> references around the net to other people trying similar things...
> >>>>> however,
> >>>>> I know it's not meant to be used this way, so I thought I would
post
> >>>>> here
> >>>>> and ask for guidance?  Has anyone done something similar?  Is
there
> any
> >>>>> specific reason to think this is a bad idea?
> >>>>>
> >>>>> The one thing that I am least certain about his how well it will
> scale.
> >>>>> We
> >>>>> may reach the point where we have tens of millions of documents
and a
> >>>>> high
> >>>>> percentage of those documents may be relatively large (10k-50k
each).
> >>>>>  We
> >>>>> actually would NOT be expecting/needing Lucene's normal extreme
fast
> >>>>> text
> >>>>> search times for this, but we would need reasonable times for
adding

> new
> >>>>> documents to the index, retrieving documents by ID (for iterating
> over
> >>>>> all
> >>>>> documents), optimizing the index after a series of changes, etc.
> >>>>>
> >>>>> Any advice/input/theories anyone can contribute would be greatly
> >>>>> appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>> -
> >>>>> John
> >>>>>
> >>>>>
> >>>>
---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@...
> >>>> For additional commands, e-mail: java-user-help@...
> >>>>
> >>>
> >>> Send instant messages to your online friends
> >>> http://in.messenger.yahoo.com
> >>>
---------------------------------------------------------------------

> >>> To unsubscribe, e-mail: java-user-unsubscribe@...
> >>> For additional commands, e-mail: java-user-help@...
> >>>
> >>>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@...
> >> For additional commands, e-mail: java-user-help@...
> >>
> >>
> >
>


This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@....


Re: Using lucene as a database... good idea or bad idea?

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.  
Connect data storage with simple, fast lookup and Lucene.

One field is the key (i.e. the filename) the other field is a binary,  
stored Field containing the contents of the file.  Of course, there  
are other ways of slicing and dicing, such that one can search (in the  
fuzzy sense) the content and the key by adding tokenization, etc.  
This is the more traditional model for Lucene

Also, have a look at Apache Jackrabbit.  It is a content repository  
that is implemented with Lucene.

-Grant

On Jul 29, 2008, at 10:02 AM, ನಾಗೇಶ್  
ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) wrote:

> Hi Ian,
> Yes, I see that we are discussing an "option" here.
>
> But, as I said before (the three parts to search-based solution), I  
> do not
> know (but, would like to know) how Lucene (java only - not Nutch,  
> Solr,
> etc.) can be used as a datastore.
>
> Basically, I am not able to connect "database" and Lucene java. :)
>
> Nagesh
>
>
> On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <ian.lea@...> wrote:
>
>> I don't think that anyone in this thread has said "should", just
>> "could" - it is a valid option (IMHO).  Personally, I use it as a
>> store for lucene related data because I know and like and trust it,  
>> it
>> is already there for this project so no need to introduce another
>> software dependency, and because it is blindingly fast.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್  
>> ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
>> <nageshblore@...> wrote:
>>> The way I see it, search solutions (on whatever scale) have three
>> components
>>> - data aggregation, indexing/searching and presentation of  
>>> results. I
>>> thought, Lucene did the second part only.
>>>
>>> So, I do not quite follow, why should Lucene be used for datastore ?
>>>
>>> Nagesh
>>>
>>> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll  
>>> <gsingers@...
>>> wrote:
>>>
>>>> I think the answer is it can be done and probably quite well.  I  
>>>> also
>> think
>>>> it's informative that Nutch does not use Lucene for this  
>>>> function, as I
>>>> understand it, but that shouldn't stop you either.  You might  
>>>> also have
>> a
>>>> look at Apache Jackrabbit, which uses Lucene underneath as a  
>>>> content
>>>> repository.
>>>>
>>>> -Grant
>>>>
>>>>
>>>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>>>
>>>> Hello all,
>>>>>
>>>>> I am also interested in this. I want to archive the content of the
>>>>> document using Lucene.
>>>>>
>>>>> Is it a good idea to use Lucene as storage engine?
>>>>>
>>>>> Regards
>>>>> Ganesh
>>>>>
>>>>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
>>>>> To: <java-user@...>
>>>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>>>
>>>>>
>>>>> John
>>>>>>
>>>>>>
>>>>>> I think it's a great idea, and do exactly this to store 5  
>>>>>> million+
>>>>>> documents with info that it takes way too long to get out of our
>>>>>> Oracle database (think days).  Not as many docs as you are  
>>>>>> talking
>>>>>> about, and less data for each doc, but I wouldn't have any  
>>>>>> concerns
>>>>>> about scaling.  There are certainly lucene indexes out there  
>>>>>> bigger
>>>>>> than what you propose.  You can compress the stored data to  
>>>>>> save some
>>>>>> space.  Run times for optimization might get interesting but see
>>>>>> recent threads for suggestions on that.  And since you are not  
>>>>>> too
>>>>>> concerned about performance you may not need to optimize much,  
>>>>>> or even
>>>>>> at all.
>>>>>>
>>>>>> Of course you need to remember that this is not a DBMS solution  
>>>>>> in the
>>>>>> sense of transactions, recovery, etc. but I'm sure you are  
>>>>>> already
>>>>>> aware of that.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ian.
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...>  
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have successfully used Lucene in the "tradtiional" way to  
>>>>>>> provide
>>>>>>> full-text search for various websites.  Now I am tasked with
>> developing
>>>>>>> a
>>>>>>> data-store to back a web crawler.  The crawler can be  
>>>>>>> configured to
>>>>>>> retrieve
>>>>>>> arbitrary fields from arbitrary pages, so the result is that  
>>>>>>> each
>>>>>>> document
>>>>>>> may have a random assortment of fields.  It seems like Lucene  
>>>>>>> may be
>> a
>>>>>>> natural fit for this scenario since you can obviously add  
>>>>>>> arbitrary
>>>>>>> fields
>>>>>>> to each document and you can store the actually data in the  
>>>>>>> database.
>>>>>>> I've
>>>>>>> done some research to make sure that it would meet all of our
>> individual
>>>>>>> requirements (that we can iterate over documents, update
>>>>>>> (delete/replace)
>>>>>>> documents, etc.) and everything looks good.  I've also seen a  
>>>>>>> couple
>> of
>>>>>>> references around the net to other people trying similar  
>>>>>>> things...
>>>>>>> however,
>>>>>>> I know it's not meant to be used this way, so I thought I  
>>>>>>> would post
>>>>>>> here
>>>>>>> and ask for guidance?  Has anyone done something similar?  Is  
>>>>>>> there
>> any
>>>>>>> specific reason to think this is a bad idea?
>>>>>>>
>>>>>>> The one thing that I am least certain about his how well it will
>> scale.
>>>>>>> We
>>>>>>> may reach the point where we have tens of millions of  
>>>>>>> documents and a
>>>>>>> high
>>>>>>> percentage of those documents may be relatively large (10k-50k  
>>>>>>> each).
>>>>>>> We
>>>>>>> actually would NOT be expecting/needing Lucene's normal  
>>>>>>> extreme fast
>>>>>>> text
>>>>>>> search times for this, but we would need reasonable times for  
>>>>>>> adding
>> new
>>>>>>> documents to the index, retrieving documents by ID (for  
>>>>>>> iterating
>> over
>>>>>>> all
>>>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>>>
>>>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>>> For additional commands, e-mail: java-user-help@...
>>>>>>
>>>>>
>>>>> Send instant messages to your online friends
>>>>> http://in.messenger.yahoo.com
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>> For additional commands, e-mail: java-user-help@...
>>>>>
>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>> For additional commands, e-mail: java-user-help@...
>>>>
>>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

"Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
 Connect data storage with simple, fast lookup and Lucene."
Thanks, Grant for the clarification. I see now.

Nagesh

On Tue, Jul 29, 2008 at 7:55 PM, Grant Ingersoll <gsingers@...>wrote:

> Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
>  Connect data storage with simple, fast lookup and Lucene.
>
> One field is the key (i.e. the filename) the other field is a binary,
> stored Field containing the contents of the file.  Of course, there are
> other ways of slicing and dicing, such that one can search (in the fuzzy
> sense) the content and the key by adding tokenization, etc.  This is the
> more traditional model for Lucene
>
> Also, have a look at Apache Jackrabbit.  It is a content repository that is
> implemented with Lucene.
>
> -Grant
>
>
> On Jul 29, 2008, at 10:02 AM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) wrote:
>
>  Hi Ian,
>> Yes, I see that we are discussing an "option" here.
>>
>> But, as I said before (the three parts to search-based solution), I do not
>> know (but, would like to know) how Lucene (java only - not Nutch, Solr,
>> etc.) can be used as a datastore.
>>
>> Basically, I am not able to connect "database" and Lucene java. :)
>>
>> Nagesh
>>
>>
>> On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <ian.lea@...> wrote:
>>
>>  I don't think that anyone in this thread has said "should", just
>>> "could" - it is a valid option (IMHO).  Personally, I use it as a
>>> store for lucene related data because I know and like and trust it, it
>>> is already there for this project so no need to introduce another
>>> software dependency, and because it is blindingly fast.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
>>> <nageshblore@...> wrote:
>>>
>>>> The way I see it, search solutions (on whatever scale) have three
>>>>
>>> components
>>>
>>>> - data aggregation, indexing/searching and presentation of results. I
>>>> thought, Lucene did the second part only.
>>>>
>>>> So, I do not quite follow, why should Lucene be used for datastore ?
>>>>
>>>> Nagesh
>>>>
>>>> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll <gsingers@...
>>>> wrote:
>>>>
>>>>  I think the answer is it can be done and probably quite well.  I also
>>>>>
>>>> think
>>>
>>>> it's informative that Nutch does not use Lucene for this function, as I
>>>>> understand it, but that shouldn't stop you either.  You might also have
>>>>>
>>>> a
>>>
>>>> look at Apache Jackrabbit, which uses Lucene underneath as a content
>>>>> repository.
>>>>>
>>>>> -Grant
>>>>>
>>>>>
>>>>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>>>>
>>>>> Hello all,
>>>>>
>>>>>>
>>>>>> I am also interested in this. I want to archive the content of the
>>>>>> document using Lucene.
>>>>>>
>>>>>> Is it a good idea to use Lucene as storage engine?
>>>>>>
>>>>>> Regards
>>>>>> Ganesh
>>>>>>
>>>>>> ----- Original Message ----- From: "Ian Lea" <ian.lea@...>
>>>>>> To: <java-user@...>
>>>>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>>>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>>>>
>>>>>>
>>>>>> John
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think it's a great idea, and do exactly this to store 5 million+
>>>>>>> documents with info that it takes way too long to get out of our
>>>>>>> Oracle database (think days).  Not as many docs as you are talking
>>>>>>> about, and less data for each doc, but I wouldn't have any concerns
>>>>>>> about scaling.  There are certainly lucene indexes out there bigger
>>>>>>> than what you propose.  You can compress the stored data to save some
>>>>>>> space.  Run times for optimization might get interesting but see
>>>>>>> recent threads for suggestions on that.  And since you are not too
>>>>>>> concerned about performance you may not need to optimize much, or
>>>>>>> even
>>>>>>> at all.
>>>>>>>
>>>>>>> Of course you need to remember that this is not a DBMS solution in
>>>>>>> the
>>>>>>> sense of transactions, recovery, etc. but I'm sure you are already
>>>>>>> aware of that.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ian.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>>
>>>>>>>> I have successfully used Lucene in the "tradtiional" way to provide
>>>>>>>> full-text search for various websites.  Now I am tasked with
>>>>>>>>
>>>>>>> developing
>>>
>>>> a
>>>>>>>> data-store to back a web crawler.  The crawler can be configured to
>>>>>>>> retrieve
>>>>>>>> arbitrary fields from arbitrary pages, so the result is that each
>>>>>>>> document
>>>>>>>> may have a random assortment of fields.  It seems like Lucene may be
>>>>>>>>
>>>>>>> a
>>>
>>>> natural fit for this scenario since you can obviously add arbitrary
>>>>>>>> fields
>>>>>>>> to each document and you can store the actually data in the
>>>>>>>> database.
>>>>>>>> I've
>>>>>>>> done some research to make sure that it would meet all of our
>>>>>>>>
>>>>>>> individual
>>>
>>>> requirements (that we can iterate over documents, update
>>>>>>>> (delete/replace)
>>>>>>>> documents, etc.) and everything looks good.  I've also seen a couple
>>>>>>>>
>>>>>>> of
>>>
>>>> references around the net to other people trying similar things...
>>>>>>>> however,
>>>>>>>> I know it's not meant to be used this way, so I thought I would post
>>>>>>>> here
>>>>>>>> and ask for guidance?  Has anyone done something similar?  Is there
>>>>>>>>
>>>>>>> any
>>>
>>>> specific reason to think this is a bad idea?
>>>>>>>>
>>>>>>>> The one thing that I am least certain about his how well it will
>>>>>>>>
>>>>>>> scale.
>>>
>>>> We
>>>>>>>> may reach the point where we have tens of millions of documents and
>>>>>>>> a
>>>>>>>> high
>>>>>>>> percentage of those documents may be relatively large (10k-50k
>>>>>>>> each).
>>>>>>>> We
>>>>>>>> actually would NOT be expecting/needing Lucene's normal extreme fast
>>>>>>>> text
>>>>>>>> search times for this, but we would need reasonable times for adding
>>>>>>>>
>>>>>>> new
>>>
>>>> documents to the index, retrieving documents by ID (for iterating
>>>>>>>>
>>>>>>> over
>>>
>>>> all
>>>>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>>>>
>>>>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>>>> For additional commands, e-mail: java-user-help@...
>>>>>>>
>>>>>>>
>>>>>> Send instant messages to your online friends
>>>>>> http://in.messenger.yahoo.com
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>>> For additional commands, e-mail: java-user-help@...
>>>>>>
>>>>>>
>>>>>>  --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com
>>>>>
>>>>> Lucene Helpful Hints:
>>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>> For additional commands, e-mail: java-user-help@...
>>>>>
>>>>>
>>>>>
>>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

Re: Using lucene as a database... good idea or bad idea?

by jalopyuser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I do this with uplib (http://uplib.parc.com/) with fair success.
Originally I thought I'd need Lucene plus a relational database to
store metadata about the documents for metadata searches.  So far,
though, I've been able to store the metadata in Lucene and use the
same Lucene DB for both metadata and content.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by Matthew Hall-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yeah.. we do the same thing here for indexes of up to  57M documents
(rows), and that's just one part of our implementation.

It takes quite a bit of.. wrangling to use lucene in this manner.. but
we've found it to be utterly worthwhile.

Matt

Ian Lea wrote:

> John
>
>
> I think it's a great idea, and do exactly this to store 5 million+
> documents with info that it takes way too long to get out of our
> Oracle database (think days).  Not as many docs as you are talking
> about, and less data for each doc, but I wouldn't have any concerns
> about scaling.  There are certainly lucene indexes out there bigger
> than what you propose.  You can compress the stored data to save some
> space.  Run times for optimization might get interesting but see
> recent threads for suggestions on that.  And since you are not too
> concerned about performance you may not need to optimize much, or even
> at all.
>
> Of course you need to remember that this is not a DBMS solution in the
> sense of transactions, recovery, etc. but I'm sure you are already
> aware of that.
>
>
> --
> Ian.
>
>
> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@...> wrote:
>  
>> Hi All,
>>
>> I have successfully used Lucene in the "tradtiional" way to provide
>> full-text search for various websites.  Now I am tasked with developing a
>> data-store to back a web crawler.  The crawler can be configured to retrieve
>> arbitrary fields from arbitrary pages, so the result is that each document
>> may have a random assortment of fields.  It seems like Lucene may be a
>> natural fit for this scenario since you can obviously add arbitrary fields
>> to each document and you can store the actually data in the database. I've
>> done some research to make sure that it would meet all of our individual
>> requirements (that we can iterate over documents, update (delete/replace)
>> documents, etc.) and everything looks good.  I've also seen a couple of
>> references around the net to other people trying similar things... however,
>> I know it's not meant to be used this way, so I thought I would post here
>> and ask for guidance?  Has anyone done something similar?  Is there any
>> specific reason to think this is a bad idea?
>>
>> The one thing that I am least certain about his how well it will scale.  We
>> may reach the point where we have tens of millions of documents and a high
>> percentage of those documents may be relatively large (10k-50k each).  We
>> actually would NOT be expecting/needing Lucene's normal extreme fast text
>> search times for this, but we would need reasonable times for adding new
>> documents to the index, retrieving documents by ID (for iterating over all
>> documents), optimizing the index after a series of changes, etc.
>>
>> Any advice/input/theories anyone can contribute would be greatly
>> appreciated.
>>
>> Thanks,
>> -
>> John
>>
>>    
>
>  

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@...
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by chrislusf :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It surely is possible. AFAIK, LinkedIn use lucene to store some data.

But, Lucene index in a sense is similar to database index. Both are data
structures for a specialized and limited query execution path.

So this depends on your applications' query, and how you create the lucene
index. The normal usage you listed sounds reasonable.
But you may also need to think about maintenance. In case the index is
corrupted somehow, you may also consider store the data into database, which
are more easier to manually manipulate.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Mon, Jul 28, 2008 at 6:53 PM, John Evans <john@...> wrote:

> Hi All,
>
> I have successfully used Lucene in the "tradtiional" way to provide
> full-text search for various websites.  Now I am tasked with developing a
> data-store to back a web crawler.  The crawler can be configured to
> retrieve
> arbitrary fields from arbitrary pages, so the result is that each document
> may have a random assortment of fields.  It seems like Lucene may be a
> natural fit for this scenario since you can obviously add arbitrary fields
> to each document and you can store the actually data in the database. I've
> done some research to make sure that it would meet all of our individual
> requirements (that we can iterate over documents, update (delete/replace)
> documents, etc.) and everything looks good.  I've also seen a couple of
> references around the net to other people trying similar things... however,
> I know it's not meant to be used this way, so I thought I would post here
> and ask for guidance?  Has anyone done something similar?  Is there any
> specific reason to think this is a bad idea?
>
> The one thing that I am least certain about his how well it will scale.  We
> may reach the point where we have tens of millions of documents and a high
> percentage of those documents may be relatively large (10k-50k each).  We
> actually would NOT be expecting/needing Lucene's normal extreme fast text
> search times for this, but we would need reasonable times for adding new
> documents to the index, retrieving documents by ID (for iterating over all
> documents), optimizing the index after a series of changes, etc.
>
> Any advice/input/theories anyone can contribute would be greatly
> appreciated.
>
> Thanks,
> -
> John
>

Re: Using lucene as a database... good idea or bad idea?

by John Evans-10 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi All,

Thanks for all of the feedback.  Largely as a result of the responses I've
received from the mailing list, Lucene has made it's way on to our short
list of possible solutions.  I'm not sure what the timeframe is for
implementing a prototype and testing it, but I will try to report back with
the results when/if it happens.

Thanks again,
-
John



On Tue, Jul 29, 2008 at 2:37 PM, Chris Lu <chris.lu@...> wrote:

> It surely is possible. AFAIK, LinkedIn use lucene to store some data.
>
> But, Lucene index in a sense is similar to database index. Both are data
> structures for a specialized and limited query execution path.
>
> So this depends on your applications' query, and how you create the lucene
> index. The normal usage you listed sounds reasonable.
> But you may also need to think about maintenance. In case the index is
> corrupted somehow, you may also consider store the data into database,
> which
> are more easier to manually manipulate.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Mon, Jul 28, 2008 at 6:53 PM, John Evans <john@...> wrote:
>
> > Hi All,
> >
> > I have successfully used Lucene in the "tradtiional" way to provide
> > full-text search for various websites.  Now I am tasked with developing a
> > data-store to back a web crawler.  The crawler can be configured to
> > retrieve
> > arbitrary fields from arbitrary pages, so the result is that each
> document
> > may have a random assortment of fields.  It seems like Lucene may be a
> > natural fit for this scenario since you can obviously add arbitrary
> fields
> > to each document and you can store the actually data in the database.
> I've
> > done some research to make sure that it would meet all of our individual
> > requirements (that we can iterate over documents, update (delete/replace)
> > documents, etc.) and everything looks good.  I've also seen a couple of
> > references around the net to other people trying similar things...
> however,
> > I know it's not meant to be used this way, so I thought I would post here
> > and ask for guidance?  Has anyone done something similar?  Is there any
> > specific reason to think this is a bad idea?
> >
> > The one thing that I am least certain about his how well it will scale.
>  We
> > may reach the point where we have tens of millions of documents and a
> high
> > percentage of those documents may be relatively large (10k-50k each).  We
> > actually would NOT be expecting/needing Lucene's normal extreme fast text
> > search times for this, but we would need reasonable times for adding new
> > documents to the index, retrieving documents by ID (for iterating over
> all
> > documents), optimizing the index after a series of changes, etc.
> >
> > Any advice/input/theories anyone can contribute would be greatly
> > appreciated.
> >
> > Thanks,
> > -
> > John
> >
>

Re: Using lucene as a database... good idea or bad idea?

by Marcelo Ochoa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi John:
  Did you test/know Lucene Domain Index for Oracle database?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
  If you are using Oracle 10g/11g is completed integrated in Oracle
memory space like Oracle Text but based in  Lucene.
  No network round trip is involved during indexing/querying time and
Lucene store is replaced by BLOB database storage.
  Also you can query your Oracle text store direct by SQL and two new
operators, lcontains and lscore based and Lucene and directly
integrated with the Oracle execution plan.
  Best regards, Marcelo.

On Mon, Jul 28, 2008 at 10:53 PM, John Evans <john@...> wrote:

> Hi All,
>
> I have successfully used Lucene in the "tradtiional" way to provide
> full-text search for various websites.  Now I am tasked with developing a
> data-store to back a web crawler.  The crawler can be configured to retrieve
> arbitrary fields from arbitrary pages, so the result is that each document
> may have a random assortment of fields.  It seems like Lucene may be a
> natural fit for this scenario since you can obviously add arbitrary fields
> to each document and you can store the actually data in the database. I've
> done some research to make sure that it would meet all of our individual
> requirements (that we can iterate over documents, update (delete/replace)
> documents, etc.) and everything looks good.  I've also seen a couple of
> references around the net to other people trying similar things... however,
> I know it's not meant to be used this way, so I thought I would post here
> and ask for guidance?  Has anyone done something similar?  Is there any
> specific reason to think this is a bad idea?
>
> The one thing that I am least certain about his how well it will scale.  We
> may reach the point where we have tens of millions of documents and a high
> percentage of those documents may be relatively large (10k-50k each).  We
> actually would NOT be expecting/needing Lucene's normal extreme fast text
> search times for this, but we would need reasonable times for adding new
> documents to the index, retrieving documents by ID (for iterating over all
> documents), optimizing the index after a series of changes, etc.
>
> Any advice/input/theories anyone can contribute would be greatly
> appreciated.
>
> Thanks,
> -
> John
>



--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Using lucene as a database... good idea or bad idea?

by Jason Rutherglen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A possible open source solution using a page based database would be to
store the documents in http://jdbm.sourceforge.net/ which offers BTree,
Hash, and raw page based access.  One would use a primary key type of
persistent ID to lookup the document data from JDBM.

Would be a good Lucene project to implement and I think a good solution for
Ocean LUCENE-1313.  Storing documents in Lucene is fine but for a realtime
search index with many documents being deleted a lot of garbage builds up.
Frequent merging of documents files becomes IO intensive.

Of course one issue with JDBM which I am not sure other SQL page based
systems do is load individual fields directly from disk rather than load the
entire page into RAM first, then load the pages. Maybe it does not matter.

On Mon, Jul 28, 2008 at 9:53 PM, John Evans <john@...> wrote:

> Hi All,
>
> I have successfully used Lucene in the "tradtiional" way to provide
> full-text search for various websites.  Now I am tasked with developing a
> data-store to back a web crawler.  The crawler can be configured to
> retrieve
> arbitrary fields from arbitrary pages, so the result is that each document
> may have a random assortment of fields.  It seems like Lucene may be a
> natural fit for this scenario since you can obviously add arbitrary fields
> to each document and you can store the actually data in the database. I've
> done some research to make sure that it would meet all of our individual
> requirements (that we can iterate over documents, update (delete/replace)
> documents, etc.) and everything looks good.  I've also seen a couple of
> references around the net to other people trying similar things... however,
> I know it's not meant to be used this way, so I thought I would post here
> and ask for guidance?  Has anyone done something similar?  Is there any
> specific reason to think this is a bad idea?
>
> The one thing that I am least certain about his how well it will scale.  We
> may reach the point where we have tens of millions of documents and a high
> percentage of those documents may be relatively large (10k-50k each).  We
> actually would NOT be expecting/needing Lucene's normal extreme fast text
> search times for this, but we would need reasonable times for adding new
> documents to the index, retrieving documents by ID (for iterating over all
> documents), optimizing the index after a series of changes, etc.
>
> Any advice/input/theories anyone can contribute would be greatly
> appreciated.
>
> Thanks,
> -
> John
>

Re: Using lucene as a database... good idea or bad idea?

by Karsten F. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Grant,

you made mention of jackrabbit as example of storing data in lucene.
I did not find something like that in source-code. I found
"LocalFileSystem" and "DatabaseFileSystem".
(I found lucene for indexing and searching.)

Have I overlooked something?

Best regards
   Karsten

 
Grant Ingersoll-6 wrote:
I think the answer is it can be done and probably quite well.  I also  
think it's informative that Nutch does not use Lucene for this  
function, as I understand it, but that shouldn't stop you either.  You  
might also have a look at Apache Jackrabbit, which uses Lucene  
underneath as a content repository.

-Grant

Re: Using lucene as a database... good idea or bad idea?

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hmmm, I thought it did.  Can't say I've studied the code though, so  
I'll take your word for it.

Never mind on the Jackrabbit suggestion :-)

Cheers,
Grant

On Jul 31, 2008, at 4:54 AM, Karsten F. wrote:

>
> Hi Grant,
>
> you made mention of jackrabbit as example of storing data in lucene.
> I did not find something like that in source-code. I found
> "LocalFileSystem" and "DatabaseFileSystem".
> (I found lucene for indexing and searching.)
>
> Have I overlooked something?
>
> Best regards
>   Karsten
>
>
>
> Grant Ingersoll-6 wrote:
>>
>> I think the answer is it can be done and probably quite well.  I also
>> think it's informative that Nutch does not use Lucene for this
>> function, as I understand it, but that shouldn't stop you either.  
>> You
>> might also have a look at Apache Jackrabbit, which uses Lucene
>> underneath as a content repository.
>>
>> -Grant
>>
>>
> --
> View this message in context: http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18750334.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...

< Prev | 1 - 2 | Next >