« Return to Thread: [jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View in Thread


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627990#action_12627990 ]

Grant Ingersoll commented on LUCENE-1373:
-----------------------------------------

I think you should mirror what is done in StandardAnalyzer.  You probably could create an abstract class that all of them inherit to share the common code.

Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...

 « Return to Thread: [jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.