|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731332#action_12731332 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} There is still the problem with a TokenStream overriding a deprecated method of a core filter that will be never be called anymore (see LUCENE-1678 which faces the same problem). I will try to fix this here using the same mechanism. I tested with mixing contrib tokenfilters and core filters. I have seen no problems. {quote} Yeah that is a good fix for overriding the non-final methods of the core filters. I guess what I meant here is that my invalid use case could happen in the field: Let's say something like tee/sink lived in the third-party jar and the user upgrades to Lucene 2.9 and also upgrades the own streams/filters, but a version of the third-party jar that has the new implementations is not available yet. The user couldn't simply implement both the new and old API and use such a filter then with the not-updated third party jar, unless there was a only-old-api switch. But I'm not sure how realistic this scenario is. I guess we'll find it out sooner or later :) > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731334#action_12731334 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} Have you seen my backwards compatibility test, too? {quote} Oh cool... I guess I should have checked that before... :) I think if I remove the invalid tests it looks pretty similar to yours, so let's keep the one you have in the patch. I'm going to bed now... will add the new tee/sink stuff tomorrow. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731337#action_12731337 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- bq. But I'm not sure how realistic this scenario is. I guess we'll find it out sooner or later I think in 2.9 we have a lot of BW breaks (see the long list in CHANGES.txt at the begin). This not so realistic case is just one more :-) There should be a BW note in CHANGES.txt about the new TokenStream API and possible traps (something like: "we did our best to make it bw-compatible, but there may be the following problems: <list>. You Should upgrade all your TokenStreams as soon as possible, especially if a strange behaviour occurs.") Do you have a example of TeeSink and CachingAttributesFilter? bq. Yeah that is a good fix for overriding the non-final methods of the core filters I would try to do this now. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731555#action_12731555 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- After the whole day thinking about a solution for overriding deprecated methods, I came to one conclusion/solution, that would create a "visible" backwards break (to be noted in CHANGES.txt). Mike's idea from LUCENE-1678 is good, but very complicated for this issue and may lead to unpredicted behavior. And what makes me think, that this will not be a problem for developers is the fact that there is no JIRA issue about a similar break. When Lucene switched from next() to next(reusableToken), we also had a compatibility method in TokenStream that delegates to next(new Token()). Core streams did *not* implement the old method and the indexer code only called next(Token). If somebody would have overridden only the old next() method of a core tokenstream, this method would have been never called -> bumm we had a break, but nobody realized it. With the new patch, we have the same in 2.9 for incrementToken vs. next(Token) and also next(). In principle the same issue like in LUCENE-1678. The good thing is, that most TokenStreams in core are final, except the following ones: - ISOLatin1Filter - KeywordTokenizer - StandardTokenizer and last but not least the whole structure of subclasses of CharTokenizer. The good thing and thanks to the developer, they are correctly implemented, making their methods incrementToken, next(Token) *final*. Haha, nobody could override them, so the class is not final, but the affected methods. So all subclasses of CharTokenizer are also not affected. My latest patch also includes this *final* modifier for the abstract CharTokenizer: {code} public final Token next(final Token reusableToken) throws IOException { // Overriding this method to make it final as before has no effect for the reflection-wrapper in TokenStream. // TokenStream.hasReusableNext is true because of this, but it is never used, as incrementToken() has preference. return super.next(reusableToken); } {code} So it is not overrideable and is still compatible (code calling next(Token) will be delegated to incrementToken() by the superclass). For complete correctness also next() should be similar overridden. In both cases the super's method always delegates preferably to incrementToken() so iven that a subclass of TokenStream overrides this method and so hasNext == true and hasReusableNext == true, incrementToken() is still preferred, so everything works. To prevent users from overriding next() or next(Token) of core or contrib tokenstreams (which in my opinion nobody has ever done, because if yes, we would have a bug report regarding the last transition). For those people, that really have done it (they used one of the tree classes above as super for their own class, the error would not be to detect. Their TokenStream would simply not work, as next()/next(Token) is never called. To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include a backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() *and* next(Token) as final (only delegating to super). Instead of failing silently, users will get runtime linker errors (when they replace the lucene jar) or compile errors. We have done a similar change in TokenFilter, because we made the delegate stream final to prevent disturbing the attributes (Mike have done this in LUCENE-1636). CHANGES.txt would contain this as BW-break together with the other breaks. Any comments? Michael, what do you think? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731555#action_12731555 ] Uwe Schindler edited comment on LUCENE-1693 at 7/15/09 10:04 AM: ----------------------------------------------------------------- After the whole day thinking about a solution for overriding deprecated methods, I came to one conclusion/solution, that I would create a "visible" backwards break (to be noted in CHANGES.txt). Mike's idea from LUCENE-1678 is good, but very complicated for this issue and may lead to unpredicted behavior. And what makes me think, that this will not be a problem for developers, is the fact that there is no JIRA issue about a similar break in the past: When Lucene switched from next() to next(reusableToken), we also had a compatibility method in TokenStream that delegates to next(new Token()). Core streams did *not* implement the old method and the indexer code only called next(Token). If somebody would have overridden only the old next() method of a core tokenstream, this method would have been never called -> bumm we have a break, but nobody realized it. With the new patch, we have the same in 2.9 for incrementToken vs. next(Token) and also next(). In principle the same issue like in LUCENE-1678. The good thing is, that most TokenStreams in core are final, except the following ones: - ISOLatin1Filter - KeywordTokenizer - StandardTokenizer ...and last but not least the whole structure of subclasses of CharTokenizer. The good thing is (and thanks to the developer!), they are correctly implemented, making their methods incrementToken, next(Token) *final*. Haha, nobody could override them, so the class is not final, but the affected methods. So all subclasses of CharTokenizer are also not affected. My latest patch also includes this *final* modifier for the abstract CharTokenizer: {code} public final Token next(final Token reusableToken) throws IOException { // Overriding this method to make it final as before has no effect for the reflection-wrapper in TokenStream. // TokenStream.hasReusableNext is true because of this, but it is never used, as incrementToken() has preference. return super.next(reusableToken); } {code} So it is not overrideable and is still compatible (code calling next(Token) will be delegated to incrementToken() by the superclass). For complete correctness also next() should be similar overridden. In both cases the super's method always delegates preferably to incrementToken() so iven that a subclass of TokenStream overrides this method and so hasNext == true and hasReusableNext == true, incrementToken() is still preferred, so everything works. This prevents users from overriding next() or next(Token) of core or contrib tokenstreams (which in my opinion nobody has ever done, because if yes, we would have a bug report regarding the last transition). For those people, that really have done it (they used one of the tree classes above as super for their own class), the error would not be to detectable. Their TokenStream would simply not work, as next()/next(Token) is never called. To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include this backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() *and* next(Token) as final (only delegating to super) in the first subclass that implements incrementToken(). Instead of failing silently, users will get runtime linker errors (when they replace the lucene jar) or compile errors. We have done a similar change in TokenFilter, because we made the delegate stream final to prevent disturbing the attributes (Mike have done this in LUCENE-1636). CHANGES.txt would contain this as BW-break together with the other breaks. Any comments? Michael, what do you think? was (Author: thetaphi): After the whole day thinking about a solution for overriding deprecated methods, I came to one conclusion/solution, that would create a "visible" backwards break (to be noted in CHANGES.txt). Mike's idea from LUCENE-1678 is good, but very complicated for this issue and may lead to unpredicted behavior. And what makes me think, that this will not be a problem for developers is the fact that there is no JIRA issue about a similar break. When Lucene switched from next() to next(reusableToken), we also had a compatibility method in TokenStream that delegates to next(new Token()). Core streams did *not* implement the old method and the indexer code only called next(Token). If somebody would have overridden only the old next() method of a core tokenstream, this method would have been never called -> bumm we had a break, but nobody realized it. With the new patch, we have the same in 2.9 for incrementToken vs. next(Token) and also next(). In principle the same issue like in LUCENE-1678. The good thing is, that most TokenStreams in core are final, except the following ones: - ISOLatin1Filter - KeywordTokenizer - StandardTokenizer and last but not least the whole structure of subclasses of CharTokenizer. The good thing and thanks to the developer, they are correctly implemented, making their methods incrementToken, next(Token) *final*. Haha, nobody could override them, so the class is not final, but the affected methods. So all subclasses of CharTokenizer are also not affected. My latest patch also includes this *final* modifier for the abstract CharTokenizer: {code} public final Token next(final Token reusableToken) throws IOException { // Overriding this method to make it final as before has no effect for the reflection-wrapper in TokenStream. // TokenStream.hasReusableNext is true because of this, but it is never used, as incrementToken() has preference. return super.next(reusableToken); } {code} So it is not overrideable and is still compatible (code calling next(Token) will be delegated to incrementToken() by the superclass). For complete correctness also next() should be similar overridden. In both cases the super's method always delegates preferably to incrementToken() so iven that a subclass of TokenStream overrides this method and so hasNext == true and hasReusableNext == true, incrementToken() is still preferred, so everything works. To prevent users from overriding next() or next(Token) of core or contrib tokenstreams (which in my opinion nobody has ever done, because if yes, we would have a bug report regarding the last transition). For those people, that really have done it (they used one of the tree classes above as super for their own class, the error would not be to detect. Their TokenStream would simply not work, as next()/next(Token) is never called. To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include a backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() *and* next(Token) as final (only delegating to super). Instead of failing silently, users will get runtime linker errors (when they replace the lucene jar) or compile errors. We have done a similar change in TokenFilter, because we made the delegate stream final to prevent disturbing the attributes (Mike have done this in LUCENE-1636). CHANGES.txt would contain this as BW-break together with the other breaks. Any comments? Michael, what do you think? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731859#action_12731859 ] Michael Busch commented on LUCENE-1693: --------------------------------------- I like the cleanup you did! Good that initialize() is gone now. The only small performance improvement we should probably make is to avoid checking which method in TokenStream is overridden when onlyUseNewAPI==true. {quote} To produce a compile error for them (or a runtime error, when they instantiate such a class), I suggest to include this backwards-break (which is better than failing silently). All non-final TokenStreams/Tokenizers/TokenFilters should simply include the code snipplet above to redeclare next() and next(Token) as final (only delegating to super) in the first subclass that implements incrementToken(). {quote} +1. I think this backwards-compatibility break is acceptable and makes sense. Most likely the final was just forgotten in these three classes in the first place - all the other core classes declare these methods correctly as final. So we can kind of consider this as a bug fix. And I like that they will get a compile or link error, instead of seeing undefined behavior. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1693: ---------------------------------- Attachment: lucene-1693.patch This is basically your last patch with these changes: - I removed AttributeSource.setAttributeFactory(factory). Since we have the constructor now that takes the factory as an arg, there should be no need to ever change the factory after a TokenStream was created. It would also lead to problems regarding e.g. Tee/Sink: a user could add attributes to the Tee, then change the factory, then create the sink. How could we then create the same attribute impls for the sink? So I think the right thing to do is to not allow changing the factory after the stream is instantiated. - I added the initial (untested) version of TeeSinkTokenFilter to demonstrate how I think it should work now. I'll finish tomorrow or Friday (add more javadocs and unit test). I'll also add the CachingAttributeTokenFilter, which is essentially almost the same as the new inner class of TeeSinkTokenFilter. When I have CATF the inner class can probably just extend it. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731893#action_12731893 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Ok looks good. I think you will go to bed now, so the work would not collide. If you start to program again, ask me, that I will post a patch (which makes merging simplier). TortoiseSVN has a problem with merging added files, so when applying your patch I have to remove them first :-( Some comments: - TeeSinkTokenFilter looks good, I think we should also add a test for it (in principle the version of TestTeeTokenFilter from current trunk, not the one reverted to old API from the current patch) - I do not understand completely why this WeakReference is needed between Tee and Sink? If it is needed, the code may fail with NPE, when Reference.get() returns null. The idea is, that one can create a Sink for the Tee and throw the Sink away. Tee would then simply not pass the attributes anymore to the sink? If this is the case, the check for Reference.get()==null is really missing. - Should I implement CachingAttributesFilter as replacement for CachingTokenFilter, or will you do it together with TeeSink? I will now start to add all the finals to the missing core analyzers. bq. The only small performance improvement we should probably make is to avoid checking which method in TokenStream is overridden when onlyUseNewAPI==true I could disable this for next() and next(Token). In the case of incrementToken, it should really check, that it is enabled, because not doing so would fail hard create endless loops. So the check should be there in all cases. But if onlyUseNewAPI is enabled, I could simply define hasNext and hasReusableNext=false. I will do this. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731896#action_12731896 ] Grant Ingersoll commented on LUCENE-1693: ----------------------------------------- Favor to ask, when this is ready to commit, can you give a few days notice so that the rest of us can look at it before committing? I've been keeping up with the comments, but not the patches. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693.patch New patch with some more work. First the phantastic news: As CachingTokenFilter has no API to access the cached attributes/tokens directly, it does not need to be deprecated, it just switched the internal and hidden impl to incrementToken() and attributes. I also added an additional test in the BW-Testcase, that checks if the caching also works for your strange POSTokens. And it works! You can even mix the consumers, e.g. first use new API to cache tokens and then replay using the old API. really cool. The problem, why the POSToken was not preserved in the past was an error in TokenWrapper.copyTo(). This method created a new Token and copied the contents into it using reinit(). Now it simply creates a clone and let delegate point to it (this is how the caching worked before). In principle Tee/SinkTokenizer could also work like this, the only problem with this class is the fact, that it has a public API that exposes the Token instances to the outside. Because of that, there is no way around deprecating. Your new TeeSinkTokenFilter looks good, it only had one problem: It used addAttributeImpl to add the attribute of the Tee to the new created Sink. Because of this, the sink got the same instance as the parent added. With useOnlyNewAPI, this does not have an effect for the standard attributes, as the ctor already created a Token instance as implementation and added it to the stream, so addAttributeImpl had no effect. I changed this to use the getAttributeClassesIterator and added a new attribute instance for each attribute using addAttribute to the sink. As the factory is the same, the attributes are generated in the same way. TeeSinkTokenizer would only *not* work correctly if somebody addes an custom instance using addAttributeImpl in one ctor of another filter in the chain. In this case, the factory would create another impl and restoreState throws IAE. In backwards compatibility mode (default) the new created sink and also the tee have always the default TokenWrapper implementation, so state restoring also works. You only have a problem if you change useOnlyNewAPIU inbetween (which would always create corrupt chains). Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? I started to create a test for the new TeeSinkTokenFilter, but there is one thing missing: The original test created a subclass of SinkTokenizer, overriding add() to filter the tokens added to the sink. This functionality is missing with the new API. The correct workaround would be to plug a filter around the sink and filter the tokens there? The problem is then, that the cache always contains also non-needed tokens (the old impl would not store them in the sink). Maybe we add the filter to the TeeSinkTokenFilter (getting a State, which would not work, as contents of state pkg-private?). Somehow else? Or leave it as it is and let the user plug the filter on top of the sink (I prefer this)? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731947#action_12731947 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- I forgot: I also implemented the final next() methods in all non-final classes. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731949#action_12731949 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- bq. Favor to ask, when this is ready to commit, can you give a few days notice so that the rest of us can look at it before committing? I've been keeping up with the comments, but not the patches. No problem. I want to finish this until the weekend and then you have time to review it. My holidays start next week on monday, so I have only limited time after that. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732407#action_12732407 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} As CachingTokenFilter has no API to access the cached attributes/tokens directly, {quote} Oh true :) well, scratch it from the TODO list... We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above. {quote] Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? {quote} Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams. The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes. I think it should be easy? We just need to implement clone() for AttributeSource. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732407#action_12732407 ] Michael Busch edited comment on LUCENE-1693 at 7/17/09 1:37 AM: ---------------------------------------------------------------- {quote} As CachingTokenFilter has no API to access the cached attributes/tokens directly, {quote} Oh true :) well, scratch it from the TODO list... We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above. {quote} Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? {quote} Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams. The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes. I think it should be easy? We just need to implement clone() for AttributeSource. was (Author: michaelbusch): {quote} As CachingTokenFilter has no API to access the cached attributes/tokens directly, {quote} Oh true :) well, scratch it from the TODO list... We knew it'd work conceptually the same for AttributeSource.State; unlike Tee/Sink, which wouldn't even be save to use with the new API if it hadn't the getTokens() method for the reasons I explained above. {quote] Another idea would be to clone all attribute impls and then add them to the sink - the factory would then not be used? {quote} Yes, I thought about this for a while. It would be nice to have this (i. e. cloning an AttributeSource) in general: you could reduce the costs for initializing the TokenStreams with onlyUseNewAPI=false. We just need to keep a static AttributeSource around, that contains the wrapper and the mappings from the 6 default interfaces. Then instead of constructing it every time we just clone that AttributeSource for new TokenStreams. The query parser could do the same to keep initialization costs of the TokenStreams minimal, because it always needs the same attributes. I think it should be easy? We just need to implement clone() for AttributeSource. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1693: ---------------------------------- Attachment: lucene-1693.patch I made these changes: - added clone() to AttributeSource and changed TeeSinkTokenFilter to use it. - added a SinkFilter as inner interface of TeeSinkTokenFilter that adds the missing functionality you mentioned, Uwe. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732450#action_12732450 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Patch looks good. Only one thing: If you clone a TokenStream you will not get a TokenStream, only an AttributeSource instance (if TokenStream does not override). For our use case it is ok, because we only want to have the attributes and impls cloned, but it is strange. A real clone() method should call super.clone() and then create new maps and copy the old maps into them. Not sure. Or we do not call the method clone() and call it cloneAttributes not returning Object but AttributeSource. E.g. {code}public AttributeSource cloneAttributes(){code} I will now rewrite the TeeSink-Test to use the new interface and the test should then pass as before with Tee/Sink separate (but I let both tests available, one that tests the old ones and one that tests the new class). I also add a test for the cloning to TestAttributeSource. The cloning also speeds up the case with useOnlyNewAPI=true, because the addAttribute-call also uses reflection to find out what interfaces are implemented. In my opinion this cost (the while loop with getSuperclass() and so on) is much more costy than the simple check of the declaring class for a method. The patch is still missing some javadocs. If we finished this, are there any other things to do? The optimizations in QueryParser to clone and so on are not really part of this issue, so could be done separately. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732458#action_12732458 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} Or we do not call the method clone() and call it cloneAttributes not returning Object but AttributeSource. {quote} +1. Let's do that. {quote} If we finished this, are there any other things to do? The optimizations in QueryParser to clone and so on are not really part of this issue, so could be done separately. {quote} I agree. I don't think these optimizations are critical at this point. I think updating the javadocs should be the only remaining thing here. (given that everyone else is ok with this patch) The other related issues I think will be straightforward... except LUCENE-1448, I have the feeling this will cause some headaches too.... not sure if you read the discussions in 1448 yet, Uwe? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732478#action_12732478 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- bq. +1. Let's do that. OK, I change this locally. And I remove the Cloneable interface again. In principle, this method cloneAttributes() should only be used, to create a new TokenStream that should use the same attributes, but needs different instances. TeeSink is currently the only example for this, but more may follow. About the TokenStream clones in QueryParser: I think, this will not work, as the TokenStream then needs to be really cloned with also setting a new Reader for the input. In my opinion, the reusableTokenStream method of Analyzer should handle this and not QueryParser. bq. The other related issues I think will be straightforward... except LUCENE-1448, I have the feeling this will cause some headaches too.... not sure if you read the discussions in 1448 yet, Uwe? Wrrrrr, this one is hard. This final offset is not really fitting very good into the current attributes API, it could be an new extra attribute that is only updated at the end of the stream (but the problem is, that it needs to be done when incrementToken returns false. ...and Mike said all others are trivial :( I will now update as noted before > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732483#action_12732483 ] Mark Miller commented on LUCENE-1693: ------------------------------------- bq. ...and Mike said all others are trivial He also said hes willing to skip that one for 2.9 though. I'd rather not if we can help it though - I started looking at it last night, but I got side tracked before I got very far thinking about it. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693.patch Here my latest patch before I go to bed. I had not much time today, but I implemented parts of the TeeSinkTokenFilter test. The first test and also the performance test are implemented. The performance test is almost as fast as the old Tee/Sink combi (good news). I found a small bug in the new Sink (it did not lazyly created the iterator), but it is fixed and it works as exspected (without calling reset() on the sink first). The second test is not implementable with TeeSinkTokenizer and this is a limitation: You are not able to combine different sources into one sink (and this is what the second test does). I am not sure, how you could implement this at all with the new API. It would only work if both tee streams have exactly same attributes, so they could feed their attributes into the same sink. Michael, any idea? The functionality to feed tokens from two streams into one sink is nice, but how to do it? Or is it just a useless theoretical demonstration in the test? Good night, Uwe > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead by default incrementToken() throws a subclass of > UnsupportedOperationException. The indexer tries to call > incrementToken() initially once to see if the exception is thrown; > if so, it falls back to the old API. > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > Also, the TokenStream API does not change, except for the removal > of the set/getUseNewAPI methods. So the patches in LUCENE-1460 > should still work. > All core tests pass, however, I need to update all the documentation > and also add some unit tests for the new AttributeSource > functionality. So this patch is not ready to commit yet, but I wanted > to post it already for some feedback. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |