|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
|
|
[jira] Created: (LUCENE-1313) Ocean Realtime SearchOcean Realtime Search
--------------------- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Jason Rutherglen Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. - getSearcher() - addDocument(Document document) - addDocument(Document document, Analyzer analyzer) - updateDocument(Term term, Document document) - updateDocument(Term term, Document document, Analyzer analyzer) - deleteDocument(Term term) - deleteDocument(Query query) - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) Sample code: {code} // setup FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); LogDirectory logDirectory = directoryMap.getLogDirectory(); TransactionLog transactionLog = new TransactionLog(logDirectory); TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); // transaction Document d = new Document(); d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); system.addDocument(d); // search OceanSearcher searcher = system.getSearcher(); ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; System.out.println(hits.length + " total results"); for (int i = 0; i < hits.length && i < 10; i++) { Document d = searcher.doc(hits[i].doc); System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); } {code} There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. A sample disk directory structure is as follows: |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| |/3 | Directory containing an on disk Lucene index| |/log | Directory containing log files| |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: lucene-1313.patch lucene-1313.patch Patch includes libraries: * commons-io-1.3.2.jar * commons-lang-2.3.jar * jdom.jar * slf4j-api-1.5.2.jar * slf4j-simple-1.5.2.jar * source from http://reader.imagero.com/uio/ * source from net.sourceforge.jsorter > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: lucene-1313.patch lucene-1313.patch Removed http://reader.imagero.com/uio/ code as it routinely corrupted the log. It's replacement is RandomAccessFile. Added MultiThreadSearcherPolicy that is used to created a multi threaded Searcher. Transaction multithreading has been removed because it makes it hard to debug. It will be optional in the future. Many bugs have been fixed. TestSearch tests for deletes. Index directories are now of the form "2_index", the index id and the suffix "_index". > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: lucene-1313.patch lucene-1313.patch Depends on LUCENE-1312 and LUCENE-1314. More bugs fixed. Deletes are committed to indexes only intermittently which improves the update speed dramatically. MaybeMergeIndexes now runs via a background timer. Will remove writing a snapshot.xml file per transaction in favor of a human readable log. Creating and deleting these small files is a bottleneck for update speed. This way a transaction writes to 2 files only. The merges happen in the background and so never affect the transaction update speed. I am not sure how useful it would be, but it is possible to have a priority based IO system that favors transactions over merges. If a transaction is coming in and a merge is happening to disk, the merge is stopped and the transaction IO runs, then the merge IO continues. I am not sure how to handle Documents with Fields that have a TokenStream as the value as I believe these cannot be serialized. For now I assume it will be unsupported. Also not sure how to handle analyzers, are these generally serializable? It would be useful to serialize them for a more automated log recovery process. > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: lucene-1313.patch lucene-1313.patch - Depends on LUCENE-1314 - OceanSegmentReader implements reuse of deletedDocs bytes in conjunction with LUCENE-1314 - Snapshot logging happens to a rolling log file - CRC32 checking added to transaction log - Added TestSystem test case that performs adds, updates and deletes. TestSystem uses arbitrarily small settings numbers to force the various background merges to happen within a minimal number of transactions - Transactions with over N documents encoded into a segment (via RAMDirectory) to the transaction log rather than serialized as a Document - Started wiki page http://wiki.apache.org/lucene-java/OceanRealtimeSearch linked from http://wiki.apache.org/lucene-java/LuceneResources. Will place documentation there. - Document fields with Reader or TokenStream values supported Began work on LargeBatch functionality, needs test case. Large batches allow adding documents in bulk (also performing deletes) in a transaction that goes straight to an index bypassing the transaction log. This provides the same speed as using IndexWriter to perform bulk Document processing in Ocean. Started OceanDatabase which will offer a Java API inspired by GData. Will offer optimistic concurrency (something required in a realtime search system) and dynamic object mapping (meaning types such as long, date, double will be mapped to a string term using some Solr code). A file sync is performed after each transaction, will add an option to allow syncing after N transactions like mysql. This will improve realtime update speeds. Future: - Support for multiple servers by implementing distributed API and replication using LUCENE-1336 - Test case that is akin to TestStressIndexing2 mainly to test threading - Add LargeBatch test to TestSystem - Facets - Looking at adding GData compatible XML over HTTP API. Possibly can reuse the old Lucene GData code. - Integrate tag index when it's completed - Add LRU record cache to transaction log which will be useful for faster replication > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627612#action_12627612 ] Karl Wettin commented on LUCENE-1313: ------------------------------------- Hi Jason, I took an inital look at your code last night. Didn't actually execute anything, just followed method calls around to see what it was up to. My first comment is sort of boring, but there are virtually no javadocs for the core classes such as TransactionSystem, Batch and Index. It would be great if there was a bit at the class level exaplaining what classes they interact with and how. It would also be very helpful if there was method level javadocs for at least the top level commit related logic. One thing that early cought my attention is this method in TransactionSystem: {code:java} public OceanSearcher getSearcher() throws IOException { Snapshot snapshot = snapshots.getLatestSnapshot(); if (searcherPolicy instanceof SingleThreadSearcherPolicy) { return new OceanSearcher(snapshot); } else { return new OceanMultiThreadSearcher(snapshot, searchThreadPool); } } {code} Am I supposed to call this method for each query (as suggested by the method name) or is this a factory method used to update my own Searcher instance after committing documents to the index (as suggested by the code)? It's not such a big deal, but I personally think you should refactor the instanceOf to a Policy.searcherFactory method, or perhaps even a SearcherPolicyVisitor. Actually, this goes for a few other places in the module too: you have used instanceOf and unchecked casting a bit more extensive to solve problems than what I would have. But as it does not seem to be used in places where it would be a costrly thing to do these comments are mearly about code readability and gut feelings about future problems. I'm a bit concerned about the potential loss of data while documents only resides in InstantiatedIndex or RAMDirectory. I think I'd like an option on some sort of transaction log that could be played up in case of a crash. I think the easiset way would be to convert all documents to be pre-analyzed (field.tokenStream) before passing them on to the instantiated writer. I don't know how much resources that might consume, but it would make me feel safer. karl > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627642#action_12627642 ] Jason Rutherglen commented on LUCENE-1313: ------------------------------------------ Hi Karl, Thanks for taking a look at the code! Yes the methods need javadoc, I was waiting to see if I had settled on them, and because I started building new code on top I guess the methods have settled so I need to add javadoc to them. If you are using TransactionSystem then the getSearcher method would be called for each query. I have developed OceanDatabase which makes the searching transparent and implements optimistic concurrency (version number stored in the document). I believe most systems will want to use OceanDatabase, however the raw TransactionSystem which is more like IndexWriter will be left as well. I have been working on OceanDatabase and have neglected the javadocs of TransactionSystem. I modeled the searcherPolicy instanceof code on the MergeScheduler type of system where there is a marker interface that the subclasses implement. I don't mind changing it, or if you want to you can as well. I considered it a minor detail though and admittedly did not spend much time on it. You are welcome to change it. The transaction log is replayed on a restart of the system. It repopulates a RamIndex (uses RAMDirectory) on startup based on the max snapshot id of the existing indexes, and replays the transaction log from there. I looked at converting documents to a token stream, the problem is, if the field is stored, it creates redundant storage of the data in the transaction log. Ultimately I could not find anything to be gained from storing a token stream. Also if it was converted, what would happen with stored fields? The issue with replaying the document later though is not having the Analyzer. In the distributed object code patch LUCENE-1336 I made Analyzer Serializable. I think it's best to serialize the Analyzer, or create a small database of serialized analyzers that can be called upon during the transaction log recovery process. Because I am not entirely sure about the ramifications of serializing the Analyzer, for example, how much data a serialized Analyzer may have. Perhaps other have some ideas or feedback about serializing analyzers. In conclusion, I'll add more javadocs. Please feel free to ask more questions! Jason > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628092#action_12628092 ] Jason Rutherglen commented on LUCENE-1313: ------------------------------------------ Is there a good place to place the javadocs on the Apache website once they are more complete? > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
Re: [jira] Commented: (LUCENE-1313) Ocean Realtime Search:
: Is there a good place to place the javadocs on the Apache website once they are more complete? generated javadocs aren't really neccessary (at least not at this stage) just having javadoc comments in the code makes it a lot easier to review new contributions and patches (most people reviewing contributions will either read the javadocs inline while reading the source, or bring the source code up in an IDE and let it show them the javadocs as they browse the source) -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Ocean Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: LUCENE-1313.patch LUCENE-1313.patch Added javadocs. Still needs the LUCENE-1314 completed which will be divided into multiple patches. > Ocean Realtime Search > --------------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Jason Rutherglen > Attachments: LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. > TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. > - getSearcher() > - addDocument(Document document) > - addDocument(Document document, Analyzer analyzer) > - updateDocument(Term term, Document document) > - updateDocument(Term term, Document document, Analyzer analyzer) > - deleteDocument(Term term) > - deleteDocument(Query query) > - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) > Sample code: > {code} > // setup > FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); > LogDirectory logDirectory = directoryMap.getLogDirectory(); > TransactionLog transactionLog = new TransactionLog(logDirectory); > TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); > // transaction > Document d = new Document(); > d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); > system.addDocument(d); > // search > OceanSearcher searcher = system.getSearcher(); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > System.out.println(hits.length + " total results"); > for (int i = 0; i < hits.length && i < 10; i++) { > Document d = searcher.doc(hits[i].doc); > System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); > } > {code} > There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. > A sample disk directory structure is as follows: > |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| > |/3 | Directory containing an on disk Lucene index| > |/log | Directory containing log files| > |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Component/s: (was: contrib/*) Index Fix Version/s: 2.9 Priority: Minor (was: Major) Description: Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. was: Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. - getSearcher() - addDocument(Document document) - addDocument(Document document, Analyzer analyzer) - updateDocument(Term term, Document document) - updateDocument(Term term, Document document, Analyzer analyzer) - deleteDocument(Term term) - deleteDocument(Query query) - commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries) Sample code: {code} // setup FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); LogDirectory logDirectory = directoryMap.getLogDirectory(); TransactionLog transactionLog = new TransactionLog(logDirectory); TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); // transaction Document d = new Document(); d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); system.addDocument(d); // search OceanSearcher searcher = system.getSearcher(); ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; System.out.println(hits.length + " total results"); for (int i = 0; i < hits.length && i < 10; i++) { Document d = searcher.doc(hits[i].doc); System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); } {code} There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. A sample disk directory structure is as follows: |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| |/3 | Directory containing an on disk Lucene index| |/log | Directory containing log files| |/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented| Affects Version/s: 2.4.1 Summary: Realtime Search (was: Ocean Realtime Search) > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: LUCENE-1313.patch The patch includes RealtimeIndex a basic class for performing atomic transactional realtime indexing and search. A single thread periodically flushes to disk the ram index. It relies on LUCENE-1516. We need to benchmark this, specifically 1) realtime w/ramdir transaction 2) realtime w/queued documents transaction 3) normal indexing. Realtime w/ramdir encodes the transaction to a RAMDirectory which is added to the RAM writer using IW.addIndexesNoOptimize. Option 1 may be slower than option 2, however if the system is replicating it may be the only option? Long term I believe we need to implement searching over the IndexWriter ram buffer (if possible). However I am not sure how option 2 would work with it? > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694810#action_12694810 ] Jason Rutherglen edited comment on LUCENE-1313 at 4/1/09 3:50 PM: ------------------------------------------------------------------ The patch includes RealtimeIndex a basic class for performing atomic transactional realtime indexing and search. A single thread periodically flushes to disk the ram index. It relies on LUCENE-1516. We need to benchmark this, specifically 1) realtime w/ramdir transaction 2) realtime w/queued documents transaction 3) normal indexing. Realtime w/ramdir encodes the transaction to a RAMDirectory which is added to the RAM writer using IW.addIndexesNoOptimize. Option 1 may be slower than option 2, however if the system is replicating it may be the only option? Long term I believe we need to implement searching over the IndexWriter ram buffer (if possible). However I am not sure how option 1 and replication would work with it? was (Author: jasonrutherglen): The patch includes RealtimeIndex a basic class for performing atomic transactional realtime indexing and search. A single thread periodically flushes to disk the ram index. It relies on LUCENE-1516. We need to benchmark this, specifically 1) realtime w/ramdir transaction 2) realtime w/queued documents transaction 3) normal indexing. Realtime w/ramdir encodes the transaction to a RAMDirectory which is added to the RAM writer using IW.addIndexesNoOptimize. Option 1 may be slower than option 2, however if the system is replicating it may be the only option? Long term I believe we need to implement searching over the IndexWriter ram buffer (if possible). However I am not sure how option 2 would work with it? > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694917#action_12694917 ] Michael McCandless commented on LUCENE-1313: -------------------------------------------- Jason, your last patch looks like it's taking the "flush first to RAM Dir" approach I just described as the next step (on the java-dev thread), right? Which is great! So this has no external dependencies, right? And it simply layers on top of LUCENE-1516. I'd be very interested to compare (benchmark) this approach vs solely LUCENE-1516. Could we change this class so that instead of taking a Transaction object, holding adds & deletes, it simply mirrors IndexWriter's API? Ie, I'd like to decouple the performance optimization of "let's flush small segments ithrough a RAMDir first" from the transactional semantics of "I process a transaction atomically, and lock out other thread's transactions". Ie, the transactional restriction could/should layer on top of this performance optimization for near-realtime search? > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696186#action_12696186 ] Jason Rutherglen commented on LUCENE-1313: ------------------------------------------ bq. So this has no external dependencies, right? Yes. {quote}I'd be very interested to compare (benchmark) this approach vs solely LUCENE-1516.{quote} Is the .alg using the NearRealtimeReader from LUCENE-1516 our best measure of realtime performance? {quote} the transactional restriction could/should layer on top of this performance optimization for near-realtime search? {quote} The transactional system should be able to support both methods. Perhaps a non-locking setting would allow the same RealtimeIndex class support both modes of operation? > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696277#action_12696277 ] Jason Rutherglen commented on LUCENE-1313: ------------------------------------------ We'll need to integrate the RAM based indexer into IndexWriter to carry over the deletes to the ram index while it's copied to disk. This is similar to IndexWriter.commitMergedDeletes carrying deletes over at the segment reader level based on a comparison of the current reader and the cloned reader. Otherwise there's redundant deletions to the disk index using IW.deleteDocuments which can be unnecessarily expensive. To make external we would need to do the delete by doc id genealogy. > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696438#action_12696438 ] Michael McCandless commented on LUCENE-1313: -------------------------------------------- {quote} > I'd be very interested to compare (benchmark) this approach > vs solely LUCENE-1516. Is the .alg using the NearRealtimeReader from LUCENE-1516 our best measure of realtime performance? {quote} So far, I think so? You get to set an update rate (delete + add) docs, eg 50 docs/sec, and a pause time between NRT reopens. Still, it's synthetic. If you guys (LinkedIn) have a way to fold in some realism into the test, that'd be great, if only "our app ingests at X docs(MB)/sec and reopens the NRT reader X times per second" to set our ballback. {quote} > the transactional restriction could/should layer on > top of this performance optimization for near-realtime search? The transactional system should be able to support both methods. Perhaps a non-locking setting would allow the same RealtimeIndex class support both modes of operation? {quote} Sorry, what are both "modes" of operation? I think there are two different "layers" here -- first layer optimizes NRT by flushing small segments to RAMDir first. This seems generally useful and in theory has no impact to the API IndexWriter exposes (it's "merely" an internal optimization). The second layer adds this new Transaction object, such that N adds/deletes/commit/re-open NRT reader can be done atomically wrt other pending Transaction objects. {quote} We'll need to integrate the RAM based indexer into IndexWriter to carry over the deletes to the ram index while it's copied to disk. This is similar to IndexWriter.commitMergedDeletes carrying deletes over at the segment reader level based on a comparison of the current reader and the cloned reader. Otherwise there's redundant deletions to the disk index using IW.deleteDocuments which can be unnecessarily expensive. To make external we would need to do the delete by doc id genealogy. {quote} Right, I think the RAMDir optimization would go directly into IW, if we can separate it out from Transaction. It could naturally derive from the existing RAMBufferSizeMB, ie if NRT forces a flush, so long as its tiny, put it into the local RAMDir instead of the actual Dir, then "deduct" that size from the allowed budget of DW's ram usage. When RAMDIr + DW exceeds RAMBufferSizeMB, we then merge all of RAMDir's segments into a "real" segment in the directory. > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: ------------------------------------- Attachment: LUCENE-1313.jar Latest realtime code, transactions are removed. * Needs to be benchmarked * There could be concurrency issues around deletes that occur while directories are being flushed to disk. * It's Java JARed to include the files and directory structure. The patch relies on LUCENE-1516 which if included would make the changes incomprehensible > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697226#action_12697226 ] Jason Rutherglen commented on LUCENE-1313: ------------------------------------------ {quote} Still, it's synthetic. If you guys (LinkedIn) have a way to fold in some realism into the test, that'd be great, if only "our app ingests at X docs(MB)/sec and reopens the NRT reader X times per second" to set our ballback. {quote} The test we need to progress to is running the indexing side endlessly while also reopening every X seconds, then concurrently running searches. This way we can play with a bunch of settings (mergescheduler threads, merge factors, max merge docs, etc), use the python code to generate a dozen cases, execute them and find out what seems optimal for our corpus. It's a bit of work but probably the only way each Lucene user can conclusively say they have the optimal settings needed for their system. Usually there is a baseline QPS that is desired, where the reopen delay may be increased to accommodate a lack of QPS. The ram dir portion of the NRT indexing increases in speed when more threads are allocated but those compete with search threads, another issue to keep in mind. It might be good to add some default charting to contrib/benchmark? > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1313) Realtime Search[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697444#action_12697444 ] Michael McCandless commented on LUCENE-1313: -------------------------------------------- {quote} The test we need to progress to is running the indexing side endlessly while also reopening every X seconds, then concurrently running searches {quote} Do you have a sense of what we'd need to add to contrib/benchmark to make this test possible? LUCENE-1516 takes the first baby step (adds a "NearRealtimeReaderTask"). {quote} Usually there is a baseline QPS that is desired, where the reopen delay may be increased to accommodate a lack of QPS. {quote} Right -- that's the point I made on java-dev about the "freedom" we have wrt NRT's performance. {quote} The ram dir portion of the NRT indexing increases in speed when more threads are allocated but those compete with search threads, another issue to keep in mind. {quote} Well, single threaded indexing speed is also improved by using RAM dir. Ie the use of RAM dir is orthogonal to the app's use of threads for indexing? {quote} It might be good to add some default charting to contrib/benchmark? {quote} I've switched to Google's visualization API (http://code.google.com/apis/visualization/) which is a delight (they have a simple-to-use Python wrapper). It'd be awesome to somehow get simple charting folded into benchmark... maybe start w/ shear data export (as tab/comma delimited line file), and then have a separate step that slurps that data in and makes a [Google vis] chart. > Realtime Search > --------------- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.4.1 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |