|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
how to Index only newly added documents?Hi People,
I am stuck with a problem ,i have a resources directory in which i have lot of documents , my java programs picks up documents from this directory, is there a way using lucene APIs to recognize documents that have already been indexed and thus filter then out and use only newly added documents. Thanks Tarun |
|
|
Re: how to Index only newly added documents?Look the class:
org.pdfbox.searchengine.lucene.IndexFiles This a example classe for create and indexing documents when you add or delete the documents into a directory. Basicaly you indicate this when run this class: For create de index directory try this: java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create -index <your_index_directory> <your_documents_directory> For only index directory (new or deleted files) try this (note the second argument '-create' is not present): java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index <your_index_directory> <your_documents_directory> Bye > > Hi People, > > I am stuck with a problem ,i have a resources directory in which i have > lot > of documents , my java programs picks up documents from this directory, is > there a way using lucene APIs to recognize documents that have already > been > indexed and thus filter then out and use only newly added documents. > > Thanks > Tarun > -- > View this message in context: > http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html > Sent from the Lucene - General mailing list archive at Nabble.com. > > |
|
|
|
|
|
Re: how to Index only newly added documents?thanks for the reply!..
BUt i need to filter out the already indexed documenst ...i.e if the resouces directory contains 2 documents which are indexed , then when 2 more documents are added then the indexed should only index the newly added documents in the already existing index location. Thanks
|
|
|
Re: how to Index only newly added documents?The common approach is to use a UUID field in the index and run an
updateDocument with a delete term holding the UUID for a document. That way only the latest added document for a UUID is gonna end up in the index. simon On Wed, Nov 4, 2009 at 6:41 AM, tarunsapra <t.sapra97@...> wrote: > > thanks for the reply!.. > > BUt i need to filter out the already indexed documenst ...i.e if the > resouces directory contains 2 documents which are indexed , then when 2 more > documents are added then the indexed should only index the newly added > documents in the already existing index location. > Thanks > > rodrigofurtado wrote: >> >> Look the class: >> >> org.pdfbox.searchengine.lucene.IndexFiles >> >> This a example classe for create and indexing documents when you add or >> delete the documents into a directory. >> >> Basicaly you indicate this when run this class: >> >> For create de index directory try this: >> >> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create >> -index <your_index_directory> <your_documents_directory> >> >> >> For only index directory (new or deleted files) try this (note the second >> argument '-create' is not present): >> >> >> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index >> <your_index_directory> <your_documents_directory> >> >> >> Bye >> >>> >>> Hi People, >>> >>> I am stuck with a problem ,i have a resources directory in which i have >>> lot >>> of documents , my java programs picks up documents from this directory, >>> is >>> there a way using lucene APIs to recognize documents that have already >>> been >>> indexed and thus filter then out and use only newly added documents. >>> >>> Thanks >>> Tarun >>> -- >>> View this message in context: >>> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html >>> Sent from the Lucene - General mailing list archive at Nabble.com. >>> >>> >> >> >> >> > > -- > View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26191281.html > Sent from the Lucene - General mailing list archive at Nabble.com. > > |
|
|
Re: how to Index only newly added documents?Like Simon mentioned you might want to create a document identifier or
UUID - if you don't have one already and use this code snippet to check if doc exists: string doc_id = "1234567"; Term idTerm = new Term(Fields.DOCID_FIELD,doc_id); if (mSearcher.docFreq(idTerm) > 0) { //mIndexWriter.updateDocument(idTerm,doc); //This document exists hence skip } else { mIndexWriter.addDocument(doc); } |
| Free embeddable forum powered by Nabble | Forum Help |