« Return to Thread: [jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by Mark Lassau :: Rate this Message:

Reply to Author | View in Thread

Grant Ingersoll (JIRA) wrote:
> Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.
>
>  
Grant, not sure what you mean by "b/c in your case the type value is
going to be set to ACRONYM, when your example is clearly not one."
Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interested
to get in on the discussion on the behaviour of <HOST>.
For instance, if you have a document like "visit www.apache.org", you
currently won't get a hit if you search for "apache".
In an issue tracker like JIRA, we want to be able to search for
"NullPointerException", and get a hit for the document "Application
threw java.lang.NullPointerException".

Also note that the current implementation has problems if the document
doesn't contain expected whitespace.
eg "I like Apache.They rock"
Will get tokenized to the following:
I                         <ALPHANUM>
like                    <ALPHANUM>
Apache.They    <HOST>
rock                   <ALPHANUM>

I don't think there is a simple one-size-fits-all answer to how this
should behave. It depends on the context of the app that is using Lucene.
The best answer may be to make some of the behaviour configurable, or
have a suite of specific analyzers?

Mark.

>> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
>> ------------------------------------------------------------------------------
>>
>>                 Key: LUCENE-1373
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis, contrib/analyzers
>>    Affects Versions: 2.3.2
>>            Reporter: Mark Lassau
>>            Priority: Minor
>>
>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
>> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
>> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
>> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
>> I refer to:
>> * BrazilianAnalyzer
>> * CzechAnalyzer
>> * DutchAnalyzer
>> * FrenchAnalyzer
>> * GermanAnalyzer
>> * GreekAnalyzer
>> * ThaiAnalyzer
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...

 « Return to Thread: [jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.