how to Index only newly added documents?

View: New views
6 Messages — Rating Filter:   Alert me  

how to Index only newly added documents?

by tarunsapra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi People,

I am stuck with a problem ,i have a resources directory in which i have lot of documents , my java programs picks up documents from this directory, is there a way using lucene APIs to recognize documents that have already been indexed and thus filter then out and use only newly added documents.

Thanks
Tarun

Re: how to Index only newly added documents?

by rodrigofurtado :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Look the class:

org.pdfbox.searchengine.lucene.IndexFiles

This a example classe for create and indexing documents when you add or
delete the documents into a directory.

Basicaly you indicate this when run this class:

For create de index directory try this:

java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
-index  <your_index_directory> <your_documents_directory>


For only index directory (new or deleted files) try this (note the second
argument '-create' is not present):


java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
<your_index_directory> <your_documents_directory>


Bye

>
> Hi People,
>
> I am stuck with a problem ,i have a resources directory in which i have
> lot
> of documents , my java programs picks up documents from this directory, is
> there a way using lucene APIs to recognize documents that have already
> been
> indexed and thus filter then out and use only newly added documents.
>
> Thanks
> Tarun
> --
> View this message in context:
> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>



Parent Message unknown Re: how to Index only newly added documents?

by diego.cassinera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The api a"lows you to add documents to an index.  However it does not have any functionality to detect which ones are new or changed.  Regardless, this is some what a trivial thing to do.  Just write a index app that reads the file names from standard input. On a linux shell use find or ls and pipe the result to your app.

Diego
------Original Message------
From: tarunsapra
To: general@...
ReplyTo: general@...
Subject: how to Index only newly added documents?
Sent: Nov 3, 2009 9:06 AM


Hi People,

I am stuck with a problem ,i have a resources directory in which i have lot
of documents , my java programs picks up documents from this directory, is
there a way using lucene APIs to recognize documents that have already been
indexed and thus filter then out and use only newly added documents.

Thanks
Tarun
--
View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Enviado desde mi BlackBerry® de Claro Argentina

Re: how to Index only newly added documents?

by tarunsapra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

thanks for the reply!..

BUt  i need to filter out the already indexed documenst ...i.e if the resouces directory contains 2 documents which are indexed , then when 2 more documents are added then the indexed should only index the newly added documents in the already existing index location.
Thanks
rodrigofurtado wrote:
Look the class:

org.pdfbox.searchengine.lucene.IndexFiles

This a example classe for create and indexing documents when you add or
delete the documents into a directory.

Basicaly you indicate this when run this class:

For create de index directory try this:

java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
-index  <your_index_directory> <your_documents_directory>


For only index directory (new or deleted files) try this (note the second
argument '-create' is not present):


java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
<your_index_directory> <your_documents_directory>


Bye

>
> Hi People,
>
> I am stuck with a problem ,i have a resources directory in which i have
> lot
> of documents , my java programs picks up documents from this directory, is
> there a way using lucene APIs to recognize documents that have already
> been
> indexed and thus filter then out and use only newly added documents.
>
> Thanks
> Tarun
> --
> View this message in context:
> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


Re: how to Index only newly added documents?

by Simon Willnauer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The common approach is to use a UUID field in the index and run an
updateDocument with a delete term holding the UUID for a document.
That way only the latest added document for a UUID is gonna end up in
the  index.

simon

On Wed, Nov 4, 2009 at 6:41 AM, tarunsapra <t.sapra97@...> wrote:

>
> thanks for the reply!..
>
> BUt  i need to filter out the already indexed documenst ...i.e if the
> resouces directory contains 2 documents which are indexed , then when 2 more
> documents are added then the indexed should only index the newly added
> documents in the already existing index location.
> Thanks
>
> rodrigofurtado wrote:
>>
>> Look the class:
>>
>> org.pdfbox.searchengine.lucene.IndexFiles
>>
>> This a example classe for create and indexing documents when you add or
>> delete the documents into a directory.
>>
>> Basicaly you indicate this when run this class:
>>
>> For create de index directory try this:
>>
>> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
>> -index  <your_index_directory> <your_documents_directory>
>>
>>
>> For only index directory (new or deleted files) try this (note the second
>> argument '-create' is not present):
>>
>>
>> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
>> <your_index_directory> <your_documents_directory>
>>
>>
>> Bye
>>
>>>
>>> Hi People,
>>>
>>> I am stuck with a problem ,i have a resources directory in which i have
>>> lot
>>> of documents , my java programs picks up documents from this directory,
>>> is
>>> there a way using lucene APIs to recognize documents that have already
>>> been
>>> indexed and thus filter then out and use only newly added documents.
>>>
>>> Thanks
>>> Tarun
>>> --
>>> View this message in context:
>>> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26191281.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>

Re: how to Index only newly added documents?

by Shashi Kant-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Like Simon mentioned you might want to create a document identifier or
UUID - if you don't have one already
and use this code snippet to check if doc exists:

 string doc_id = "1234567";
 Term idTerm = new Term(Fields.DOCID_FIELD,doc_id);
            if (mSearcher.docFreq(idTerm) > 0) {
                //mIndexWriter.updateDocument(idTerm,doc);
                //This document exists hence skip
            } else {
                mIndexWriter.addDocument(doc);
            }