|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693.patch Attached is a new patch, that implements the last idea: - There is no more copying of Tokens, so the API should have the same speed (almost) as before. - Per default, the chain of TokenStreams/TokenFilters can be mixed completely (test that explicitely tests this is still missing), the drawback is, that there is only *one* attribute instance called TokenWrapper (package private) that manages the exchange of the Token instance behind. - If the user knows, that all tokenizers in his JVM implement incrementToken and do not fallback to next(), he can increase speed by using the static setter setOnlyUseNewAPI(true). In this case, no single TokenWrapper is initialized and code will use the normal Attribute factory to generate the Attributes. If some old code is still available or your consumer calls next(), you will get an UOE during tokenization. The same happens, if you override initialize() and instantiate your attributes manually without super.initialize(). - When the old API is removed, TokenWrapper and large parts inside TokenStream can be removed and incrementToken() made abstract. This is identical to setting onlyUseNewAPI to true. - the api setting can only be static, because the attribute instances are generated during construction of the streams and so a later downgrade to TokenWrapper is not possible. Documentation inside this patch enforce, that at least all core tokenizers and consumers are conformant, so one must be able to set TokenStream.setOnlyUseNewAPI to true and then use StandardAnalyzer without any problem. When contrib is transformed, we can extend this to contrib. Because the code wraps the old API completely, all converted streams can be changed to only implement only incrementToken() using attributes. Super's TokenStream.next() and next(Token) manage the rest. There is no speed degradion by this, it is safe to remove (and all will be happy)! Uwe > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693.patch Sorry, small bug in cloning inside next(): the POSToken-test was failing again. But now it works also correct. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: (was: LUCENE-1693.patch) > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720676#action_12720676 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Hi Michael, I did not do any performance tests until now, I think you have the better knowledge about measuring tokenization performance. Important would be to compare perf of: - Old API with useNewAPI=true - Old API with useNewAPI=false - My impl with defaults (onlyUseNewAPI=false) - My impl with onlyUseNewAPI=true For all tests, you should only use conformant streams (e.g. from core). An good additional test would be to create a chain that has completely implemented incrementToken() and one only suplying next() for some chain entries. Is this hard to do? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720692#action_12720692 ] Shai Erera commented on LUCENE-1693: ------------------------------------ You can run tokenize.alg which invokes the ReadTokenTask, which iterates on a TokenStream. You'll probably need to modify the .alg file to create a different analyzer/token stream each time, and I think this can be done by the "rounds" syntax in benchmark. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720794#action_12720794 ] Michael Busch commented on LUCENE-1693: --------------------------------------- I'm looking at TokenStream.next(): {code:java} public Token next(final Token reusableToken) throws IOException { // We don't actually use reusableToken, but still add this assert assert reusableToken != null; checkTokenWrapper(); return next(); } /** Returns the next token in the stream, or null at EOS. * @deprecated The returned Token is a "full private copy" (not * re-used across calls to next()) but will be slower * than calling {@link #next(Token)} instead.. */ public Token next() throws IOException { checkTokenWrapper(); if (incrementToken()) { final Token token = (Token) tokenWrapper.delegate.clone(); Payload p = token.getPayload(); if (p != null) { token.setPayload((Payload) p.clone()); } return token; } return null; } {code} This seems like a big performance hit for users of the old API, no? Now every single Token will be cloned, even if they implement next(Token), as soon as the users have one filter in the chain that doesn't implement the new API yet. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720809#action_12720809 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- The code is almost identical to before, the old code also copied the token to make it a full private copy. There are three modes of operation: - if incrementToken is implemented, docinverter will use it (the code always calls incrementToken, so no indirection) - if next(Token) is implemented, the docinverterwill call incrementToken which is forwarded to next(Token), which is cheap - if only next() is implemented, the docinverter will call incrementTojen, which forwards to next(Token) and this forwards to next(). But this is identical to before, only one indirection more: the old code got useNewAPI(false) and called next(Token) which forwarded to next() So for indexing using the normal indexing components (docinverter), the code is never cloning more that with your code. There is one other case: if you have an old consumer calling nextToken(Token), the tokenizer only implemented incrementToken, then you will get a performance degradion. But this is not the indexing case, it is e.g. reusing the tokenizer in a very old e.g. QueryParser. I did not find a good way to pass directly for this special case to incrementToken(). The problem is also, that incrementToken uses the internal buffer and not the supplied buffer. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720811#action_12720811 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720820#action_12720820 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} Ah I understand the problem: As I told, if a consumer (like a filter() calls next(Token) on the underlying filter), which does not implement this or implements the new API, he will get a performance decrease because of cloning. I think, we should simply test this with the benchmarker. Mixing old and new API is always a performance decrease. {quote} Yes that's what I mean. But I think this will almost be the most common use case: I would think most users have chains that mix core streams/filters with custom filters. Also I assume most users who need high performance switched from next() to next(Token) by now. These users will see a performance degradation, which I predict will be similar or worse as going back to using next(), unless they implement the new API in their filters right away. So those users will see a performance hit if they just do a drop-in replacement of the lucene jar. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720822#action_12720822 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- I could change the calling chain: incrementToken() calls next() calls next(Token), would this be better. next(Token) would per default set the delegate to the reuseable token. hmhm - thinking about it. Where is then the degradion? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720846#action_12720846 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- I have a solution to build in some shortcuts: in initialize I use reflection (see the earlier patch) to find out, which of the three methods is implemented (check if this.getClass().getMethod(name,params).getDeclaringClass() == TokenStream.class, when this is true, the method was *not* overridden). in incrementToken() the method checks if either next(Token) or next() is implemented and calls direct. The same in the other classes. next() should be ideally never called then. I will post a patch later. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720849#action_12720849 ] Mark Miller commented on LUCENE-1693: ------------------------------------- Should I wait to put in the Highlighter update till you guys are done here? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720854#action_12720854 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- bq. Should I wait to put in the Highlighter update till you guys are done here? You can start with highlighter, if this patch goes through, we can remove the next() methods from all tokenizers. For consumers like the highlighter, there will be no need anymore to switch between old/new api. Just use the new API, it will also work with old tokenizers. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720887#action_12720887 ] Michael Busch commented on LUCENE-1693: --------------------------------------- I'm not convinced yet that we will be able to remove the implementations of next() and next(Token). Mark, I'm not familiar with what changes you need to make to the highlighter, but you should not rely yet on the fact that next() and nextToken() won't have to be implemented anymore. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693.patch Here my solution: The three default methods are now optimized to use the shortest path to the by subclasses implemented iteration method. The implemented iteration methods are determined by reflection in initialize(). Cloning now only done, if next() is directly called by a consumer, in all other cases the reuseableToken is used for passing the attributes around. The new TokenStream also checks in initialize, that one of the "abstract" methods is overridden. Because of this TestIndexWriter and the inverter singleton state was updated to at least have an empty incrementToken(). Because of this check, nobody can create a TokenStream, that loops indefinite after calling next() because no pseuso-abstract method was overridden. As incrementToken will be abstract in future, it must always be implemented, and this is what I have done. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1693: ---------------------------------- Attachment: TestCompatibility.java Slightly changes tool yields on 2.4 and identically on trunk + my patch: {noformat} new tokenstream --> proper noun api new tokenstream --> proper noun api new tokenstream api {noformat} On trunk + your latest patch: {noformat} new tokenstream --> proper noun api new tokenstream api Exception in thread "main" java.lang.ClassCastException: org.apache.lucene.util.AttributeSource$State at org.apache.lucene.analysis.SinkTokenizer.next(SinkTokenizer.java:97) at org.apache.lucene.analysis.TestCompatibility.consumeStream(TestCompatibility.java:97) at org.apache.lucene.analysis.TestCompatibility.main(TestCompatibility.java:90) {noformat} It runs three tests. The first is good with your patch; the second doesn;t seem to preserve the right Token subclass; the third throws a ClassCastException. I haven't debugged why... > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720913#action_12720913 ] Michael Busch commented on LUCENE-1693: --------------------------------------- You can probably fix CachingTokenFilter and tee/sink to behave correctly. But please remember that a user might have their own implementations of something like a CachingTokenFilter or tee/sink, which must keep working. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720919#action_12720919 ] Michael Busch commented on LUCENE-1693: --------------------------------------- Btw: SinkTokenizer in my patch has a small bug too. I need to throw a UOE in incrementToken() if it was filled using the old API. It should probably also throw a UOE when someone tries to fill it with both, old and new API streams. And that this is not allowed must be made clear in the javadocs. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720926#action_12720926 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Exactly: The problem is in SinkTokenizer. when calling next(Token) the result is casted to Token, which does not work (the iterator only contains either Tokens or States, dependent on what was added. As SinkTokenizer and TeeTokenFilter may use different APIs it crahes. The problem with the test is, that depending on chaining with old/new APIs the iter may conatin wron type. This can be fixed by removing next(Token) (preferred) or incrementToken() . The problem is that dependent on chaining it is not clear which method is called and the new/old API should not share the same state information. Because the problem is related new/old API, we should simply remove the old API from both filters, so they share the same instances in all cases! Then we do not need UOE. I will look into and check, why the Token in the second test is not preserverd > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720952#action_12720952 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- The second test does not work, because it always uses per default incrementToken. By the way, the APIdocs and behaviour changed with these three classes, TeeTokenFilter, SinkTokenizer and CachingTokenFilter: e.g. getTokens() does not return what is noted. For backwards-compatiblility we should deprecate the current versions of these class [and only let them implement next(Token)]. They can then be used even together with the new API, but they always work on Token instances. When I remove incrementToken from them your test passes complete. For the new API there should be new classes, that use attributesource and restorestate to cache and so on. But for current backwards compatibility (you mentioned, somebody have written a similar thing): If the user's class only uses next(Token) it will work as before. The problem is mixed implementations of old/new API and different cache contents. This is not a problem of my proposal! Again: We should remove the double implementations everywhere. In these special cases with caches, where the cache should contain a specific class (Tokens or AttributeSource.State), two classes are needed, one deprecated. But: what do you think about my latest patch in general? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |