|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 | Next > |
|
|
[jira] Created: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctorsMassive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
------------------------------------------------------------------------- Key: LUCENE-2034 URL: https://issues.apache.org/jira/browse/LUCENE-2034 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Simon Willnauer Priority: Minor Fix For: 3.0 Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034.patch Attached a patch with a base class that solves the code duplication issue and simplifies stopword loading for contrib analyzers. The patch contains 3 analyzers which use this new base class > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773723#action_12773723 ] Simon Willnauer commented on LUCENE-2034: ----------------------------------------- btw. if somebody comes up with a better name for the analyzer speak up! @robert: no super, fast or smart please :) simon > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034.patch I have removed a sys.out I had accidentally in there. The new patch contains more converted analyzers but I guess we should first get this in before we start converting all analyzers in contrib. Pretty neat though stuff like ChineseAnalyzer only has one method in it after inheriting BaseAnalyzer > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034.patch I changed the name from BaseAnalyzer to AbstractContribAnalyzer that might do a better job than BaseAnalyzer. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774006#action_12774006 ] Uwe Schindler commented on LUCENE-2034: --------------------------------------- An then we also have an AbstractCoreAnalyzer? weird... I want to bring this to core, too. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774016#action_12774016 ] Simon Willnauer commented on LUCENE-2034: ----------------------------------------- bq. An then we also have an AbstractCoreAnalyzer? weird... I want to bring this to core, too. I understand! Lets get this into contrib with either one name. Once we move this up in core it will be called Analyzer anyway so we can refactor it in contrib easily. The name AbstractContribAnalyzer would than again be ok as it would only contain the stopword convenience. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774029#action_12774029 ] Uwe Schindler commented on LUCENE-2034: --------------------------------------- Even in core it will be a separate class, because it makes tokenStream() and reusableTokenStream() final, so users want to create an old style Analyzer cannot do this. So we need a good name even for core. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774032#action_12774032 ] Simon Willnauer commented on LUCENE-2034: ----------------------------------------- I would be happy to get a better name for it - any suggestions - I'm having a hard time to find one. its your turn uwe :) > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034.patch I externalized the stopword related code into another base analyzer and named the analyzer AbstractAnalyzer and StopawareAnalyzer. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774816#action_12774816 ] Robert Muir commented on LUCENE-2034: ------------------------------------- Simon, i started looking at this, the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed. BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to supply a set of words that should not be stemmed. This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream() > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774816#action_12774816 ] Robert Muir edited comment on LUCENE-2034 at 11/8/09 8:29 PM: -------------------------------------------------------------- Simon, i started looking at this, the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed. BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to supply a set of words that should not be stemmed. This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream() <edit, addition> also, i think this setStemExclusionTable stuff is really unrelated to your patch, but a reuse challenge in at least this analyzer. one way to solve it would be to: * add .setStemExclusionTable to BrazilianStemFilter so it can be changed without creating a new instance. * in Brazilian Analyzer's createComponents(), cache the BrazilianStemFilter and change .setStemExclusionTable() to pass along the new value to that. was (Author: rcmuir): Simon, i started looking at this, the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed. BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to supply a set of words that should not be stemmed. This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream() > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774940#action_12774940 ] Simon Willnauer commented on LUCENE-2034: ----------------------------------------- bq. the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed. I agree, I missed to extend the testcase instead I changed it to test the constructor only. I will extend it instead. This testcase is actually a duplicate of testExclusionTableReuse(), it should test tokenStream instead of reusableTokenStream() - will fix this too. bq. This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream() that is actually what testExclusionTableReuse() does. bq. also, i think this setStemExclusionTable stuff is really unrelated to your patch, but a reuse challenge in at least this analyzer. one way to solve it would be to... I agree with your first point that this is kind of unrelated. I guess we should to that in a different issue while I think it is not that much of a deal as it does not change any functionality though. I disagree with the reuse challenge, in my opinion analyzers should be immutable thats why I deprecated those methods and added the set to the constructor. The problem with those setters is that you have to be in the same thread to change your set as this will only invalidate the cached version of a token stream hold in a ThreadLocal. The implementation is ambiguous and should go away. The analyzer itself can be shared but the behaviour is kind of unpredictable if you reset the set. If there is an instance of this analyzer around and you call the setter you would expect the analyzer to use the set from the very moment on you call the setter which is not always true. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034.txt Updated patch to current trunk (the massive patch with @Override broke the patch) - Moved AbstractAnalyzer to core - updated final Analyzers in core to use the AbstractAnalyzer - fixed TestBrazilianStemmer.java > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775011#action_12775011 ] Robert Muir commented on LUCENE-2034: ------------------------------------- simon, good solution. I agree we should deprecate these analyzer 'setter' methods, which just make things complicated for no good reason. i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that? > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775014#action_12775014 ] Simon Willnauer commented on LUCENE-2034: ----------------------------------------- bq. i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that? I agree that this is not a very good name as we all discussed during apacheCon. With ReusableAnalyzer I would guess people would expect this analyzer to be reusable which isn't the case or rather is not what this is analyzer is doing. What if we call it ComponentAnalyzer or NewStyleAnalyzer or SmartAnalyzer (ok just kidding) simon > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Fix Version/s: (was: 3.0) 3.1 Pushed this to 3.1. We might want to change existing analyzers in core to use the new analyzer too and make then final. So we are better off with pushing it to 3.1 > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778363#action_12778363 ] Robert Muir commented on LUCENE-2034: ------------------------------------- Simon, here {quote} source.reset(reader); if(sink != source) sink.reset(); // only reset if the sink reference is different from source {quote} we had a discussion on the mailing list about this: http://www.lucidimagination.com/search/document/cd8a94ebc8a4ea99/bug_in_standardanalyzer_stopanalyzer I think we should consider removing the if, and unconditionally call sink.reset(). A bad consumer might not follow the rules, although it says in TokenStream javadoc that consumers should call reset().. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034,patch Updated the patch to the current trunk. I have not removed all the deprecated methods in contrib/analyzers yet - we should open another issue for that IMO. Yet this patch still brakes back compatibility as some of the none final contrib analyzers extend StopawareAnalyzer with makes the old tokenstream / reusableTokenstream methods final. IMO this should not block this issues for the following reasons: 1. its in contrib - different story for core 2. it is super easy to port them 3. it make the API cleaner and has less code 4. those analyzers might have to change anyway due to the deprecated methods simon > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: ------------------------------------ Attachment: LUCENE-2034,patch set svn EOF property to native - missed that in the last patch > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > ------------------------------------------------------------------------- > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.9 > Reporter: Simon Willnauer > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |