|
View:
New views
16 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734963#action_12734963 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- bq. if you don't want to deal with multiple Attributes you can simply add a Token to the AttributeSource and cache the Token reference locally in your stream/filter, because Token now implements all core token attributes. See the test case for AttributeSource, which tests this. But I think it is not as easy if you have a chain of TokenFilters. Only the first one can add the Token Impl to the AttSource (when the attributes are not yet added). So if one TokenStream adds a TermAttribute and later a Token impl is added, the Token will handle all attributes except the TermAttribute. To force a whole chain to use Token as AttributeImpl, the first created TokenStream (normally the Tokenizer) should set an AttributeFactory, that creates a Token. All filters will then get it from the parent. So in general you can add an Token instance to the AttributeSource but should still reference the attributes by the interfaces. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734982#action_12734982 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} So in general you can add an Token instance to the AttributeSource but should still reference the attributes by the interfaces. {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. However, using Token as the actual implementing instance can be convenient to optimize caching or serialization performance. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735114#action_12735114 ] Michael Busch commented on LUCENE-1693: --------------------------------------- Grant, are you still reviewing? I was going to commit this today... shall I wait? > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735119#action_12735119 ] Grant Ingersoll commented on LUCENE-1693: ----------------------------------------- Go ahead, I'm satisfied. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735123#action_12735123 ] Michael Busch commented on LUCENE-1693: --------------------------------------- Cool thanks for reviewing. I'll commit later this afternoon. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735124#action_12735124 ] Robert Muir commented on LUCENE-1693: ------------------------------------- {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. {quote} Has there been any thought into reconsidering the new API's "experimental" status then? I don't think the WARNING: encourages users to use these interfaces! > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735125#action_12735125 ] Grant Ingersoll commented on LUCENE-1693: ----------------------------------------- Actually, one thing I still don't get: What happens to the attributes that have traditionally been thrown away during indexing? ie offset, type? How would one add them into the index like other attributes? Or, for that matter, exclude them. I seem to recall there being a loop over attributes somewhere in the posting process, but I can no longer find that code. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735124#action_12735124 ] Robert Muir edited comment on LUCENE-1693 at 7/24/09 12:08 PM: --------------------------------------------------------------- {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. {quote} Has there been any thought into reconsidering the new API's "experimental" status then? I don't think the WARNING: encourages users to use these interfaces! or maybe a compromise: maybe modify the javadocs to be a little less scary: does this text have to be FF0000 (red) ? was (Author: rcmuir): {quote} I completely agree. We should discourage users to reference Token and rather use the Attribute interfaces. That's the whole beauty and flexibility about this new API. {quote} Has there been any thought into reconsidering the new API's "experimental" status then? I don't think the WARNING: encourages users to use these interfaces! > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735133#action_12735133 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} I don't think the WARNING: encourages users to use these interfaces! {quote} Yeah we should improve that. Let me commit the patch as is and open a new issue for improving the warnings. I don't want to touch this patch anymore, it's so big. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735134#action_12735134 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} What happens to the attributes that have traditionally been thrown away during indexing? ie offset, type? How would one add them into the index like other attributes? Or, for that matter, exclude them. {quote} The default index format does not make use of some attributes, e.g. type, just as before. Flexible indexing will allow to customize the format; then you will be able to store whatever attribute you like in the index. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Resolved: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-1693. ----------------------------------- Resolution: Fixed Committed revision 797665. Thanks, Uwe, for all your hard work!! And thanks to everyone else who helped reviewing here. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735210#action_12735210 ] Mark Miller commented on LUCENE-1693: ------------------------------------- Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735210#action_12735210 ] Mark Miller edited comment on LUCENE-1693 at 7/24/09 4:59 PM: -------------------------------------------------------------- Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. edit must have come from the setBuffer stuff was (Author: markrmiller@...): Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735215#action_12735215 ] Michael Busch commented on LUCENE-1693: --------------------------------------- {quote} Not sure what issue it stems from, but Token has a bunch of constructors that are deprecated, but that don't point you to something new. {quote} This is already the case in Lucene 2.4; unrelated to this issue. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: ---------------------------------- Attachment: LUCENE-1693-TokenizerAttrFactory.patch This is a small improvement, related to Grant's comments: The TokenStream ctor can have a AttributeFactory, so you can create a subclass of TokenStream that uses a specific AttributeFacory (e.g. using Token instances). Filters do not need this (as they use the factory of the input stream). The factory must therefore be set on the root stream. This is normally a subclass of Tokenizer. The problem: Tokenizer does not have ctors for AttributeFacory, so you are not able to create any Tokenizer using a custom factory, e.g. for using Token as impl. I will commit this patch shortly. > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693-TokenizerAttrFactory.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735253#action_12735253 ] Uwe Schindler commented on LUCENE-1693: --------------------------------------- Committed revision: 797727 > AttributeSource/TokenStream API improvements > -------------------------------------------- > > Key: LUCENE-1693 > URL: https://issues.apache.org/jira/browse/LUCENE-1693 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1693-TokenizerAttrFactory.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, lucene-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch, PerfTest3.java, TestAPIBackwardsCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java, TestCompatibility.java > > > This patch makes the following improvements to AttributeSource and > TokenStream/Filter: > - introduces interfaces for all Attributes. The corresponding > implementations have the postfix 'Impl', e.g. TermAttribute and > TermAttributeImpl. AttributeSource now has a factory for creating > the Attribute instances; the default implementation looks for > implementing classes with the postfix 'Impl'. Token now implements > all 6 TokenAttribute interfaces. > - new method added to AttributeSource: > addAttributeImpl(AttributeImpl). Using reflection it walks up in the > class hierarchy of the passed in object and finds all interfaces > that the class or superclasses implement and that extend the > Attribute interface. It then adds the interface->instance mappings > to the attribute map for each of the found interfaces. > - removes the set/getUseNewAPI() methods (including the standard > ones). Instead it is now enough to only implement the new API, > if one old TokenStream implements still the old API (next()/next(Token)), > it is wrapped automatically. The delegation path is determined via > reflection (the patch determines, which of the three methods was > overridden). > - Token is no longer deprecated, instead it implements all 6 standard > token interfaces (see above). The wrapper for next() and next(Token) > uses this, to automatically map all attribute interfaces to one > TokenWrapper instance (implementing all 6 interfaces), that contains > a Token instance. next() and next(Token) exchange the inner Token > instance as needed. For the new incrementToken(), only one > TokenWrapper instance is visible, delegating to the currect reusable > Token. This API also preserves custom Token subclasses, that maybe > created by very special token streams (see example in Backwards-Test). > - AttributeImpl now has a default implementation of toString that uses > reflection to print out the values of the attributes in a default > formatting. This makes it a bit easier to implement AttributeImpl, > because toString() was declared abstract before. > - Cloning is now done much more efficiently in > captureState. The method figures out which unique AttributeImpl > instances are contained as values in the attributes map, because > those are the ones that need to be cloned. It creates a single > linked list that supports deep cloning (in the inner class > AttributeSource.State). AttributeSource keeps track of when this > state changes, i.e. whenever new attributes are added to the > AttributeSource. Only in that case will captureState recompute the > state, otherwise it will simply clone the precomputed state and > return the clone. restoreState(AttributeSource.State) walks the > linked list and uses the copyTo() method of AttributeImpl to copy > all values over into the attribute that the source stream > (e.g. SinkTokenizer) uses. > - Tee- and SinkTokenizer were deprecated, because they use > Token instances for caching. This is not compatible to the new API > using AttributeSource.State objects. You can still use the old > deprecated ones, but new features provided by new Attribute types > may get lost in the chain. A replacement is a new TeeSinkTokenFilter, > which has a factory to create new Sink instances, that have compatible > attributes. Sink instances created by one Tee can also be added to > another Tee, as long as the attribute implementations are compatible > (it is not possible to add a sink from a tee using one Token instance > to a tee using the six separate attribute impls). In this case UOE is thrown. > The cloning performance can be greatly improved if not multiple > AttributeImpl instances are used in one TokenStream. A user can > e.g. simply add a Token instance to the stream instead of the individual > attributes. Or the user could implement a subclass of AttributeImpl that > implements exactly the Attribute interfaces needed. I think this > should be considered an expert API (addAttributeImpl), as this manual > optimization is only needed if cloning performance is crucial. I ran > some quick performance tests using Tee/Sink tokenizers (which do > cloning) and the performance was roughly 20% faster with the new > API. I'll run some more performance tests and post more numbers then. > Note also that when we add serialization to the Attributes, e.g. for > supporting storing serialized TokenStreams in the index, then the > serialization should benefit even significantly more from the new API > than cloning. > This issue contains one backwards-compatibility break: > TokenStreams/Filters/Tokenizers should normally be final > (see LUCENE-1753 for the explaination). Some of these core classes are > not final and so one could override the next() or next(Token) methods. > In this case, the backwards-wrapper would automatically use > incrementToken(), because it is implemented, so the overridden > method is never called. To prevent users from errors not visible > during compilation or testing (the streams just behave wrong), > this patch makes all implementation methods final > (next(), next(Token), incrementToken()), whenever the class > itsself is not final. This is a BW break, but users will clearly see, > that they have done something unsupoorted and should better > create a custom TokenFilter with their additional implementation > (instead of extending a core implementation). > For further changing contrib token streams the following procedere should be used: > * rewrite and replace next(Token)/next() implementations by new API > * if the class is final, no next(Token)/next() methods needed (must be removed!!!) > * if the class is non-final add the following methods to the class: > {code:java} > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next(final Token reusableToken) throws java.io.IOException { > return super.next(reusableToken); > } > /** @deprecated Will be removed in Lucene 3.0. This method is final, as it should > * not be overridden. Delegates to the backwards compatibility layer. */ > public final Token next() throws java.io.IOException { > return super.next(); > } > {code} > Also the incrementToken() method must be final in this case > (and the new method end() of LUCENE-1448) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |