[jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

View: New views
14 Messages — Rating Filter:   Alert me  

[jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Most of the contributed Analyzers suffer from invalid recognition of acronyms.
------------------------------------------------------------------------------

                 Key: LUCENE-1373
                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis, contrib/analyzers
    Affects Versions: 2.3.2
            Reporter: Mark Lassau
            Priority: Minor


LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(

I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer

I would be willing to contribute a patch to make these Analyzers work in the next point release.

I see two ways to do this:
1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag.
    One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code,  and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true
2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
    Some of these have multiple constructors already, so I would probably just add a setter/getter to them.

The question is, which of the above would be preferred?
Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed.
However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4?


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Updated: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Lassau updated LUCENE-1373:
--------------------------------

    Description:
LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(

I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer


  was:
LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(

I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer

I would be willing to contribute a patch to make these Analyzers work in the next point release.

I see two ways to do this:
1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag.
    One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code,  and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true
2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
    Some of these have multiple constructors already, so I would probably just add a setter/getter to them.

The question is, which of the above would be preferred?
Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed.
However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4?



> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627873#action_12627873 ]

Mark Lassau commented on LUCENE-1373:
-------------------------------------

I would be willing to contribute a patch to make these Analyzers work in the next point release.

I see two ways to do this:
1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag.
One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code, and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true
2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
Some of these have multiple constructors already, so I would probably just add a setter/getter to them.

The question is, which of the above would be preferred?
Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed.
However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4?

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer
> I would be willing to contribute a patch to make these Analyzers work in the next point release.
> I see two ways to do this:
> 1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag.
>     One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code,  and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true
> 2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
>     Some of these have multiple constructors already, so I would probably just add a setter/getter to them.
> The question is, which of the above would be preferred?
> Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed.
> However, doing 2) means the least disruption to core code.
> Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627875#action_12627875 ]

Mark Lassau commented on LUCENE-1373:
-------------------------------------

Causes JIRA issue [JRA-15484|http://jira.atlassian.com/browse/JRA-15484].

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627903#action_12627903 ]

Mark Lassau commented on LUCENE-1373:
-------------------------------------

Had a closer look at the code, including changes in {{StandardAnalyzer}}.
The static default idea would need a reworking of {{StandardAnalyzer.reusableTokenStream()}}, and so I think it is safer to just add the {{replaceInvalidAcronym}} flag to the affected Analyzers.


> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627990#action_12627990 ]

Grant Ingersoll commented on LUCENE-1373:
-----------------------------------------

I think you should mirror what is done in StandardAnalyzer.  You probably could create an abstract class that all of them inherit to share the common code.

Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by Mark Lassau :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Grant Ingersoll (JIRA) wrote:
> Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.
>
>  
Grant, not sure what you mean by "b/c in your case the type value is
going to be set to ACRONYM, when your example is clearly not one."
Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interested
to get in on the discussion on the behaviour of <HOST>.
For instance, if you have a document like "visit www.apache.org", you
currently won't get a hit if you search for "apache".
In an issue tracker like JIRA, we want to be able to search for
"NullPointerException", and get a hit for the document "Application
threw java.lang.NullPointerException".

Also note that the current implementation has problems if the document
doesn't contain expected whitespace.
eg "I like Apache.They rock"
Will get tokenized to the following:
I                         <ALPHANUM>
like                    <ALPHANUM>
Apache.They    <HOST>
rock                   <ALPHANUM>

I don't think there is a simple one-size-fits-all answer to how this
should behave. It depends on the context of the app that is using Lucene.
The best answer may be to make some of the behaviour configurable, or
have a suite of specific analyzers?

Mark.

>> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
>> ------------------------------------------------------------------------------
>>
>>                 Key: LUCENE-1373
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis, contrib/analyzers
>>    Affects Versions: 2.3.2
>>            Reporter: Mark Lassau
>>            Priority: Minor
>>
>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
>> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
>> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
>> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
>> I refer to:
>> * BrazilianAnalyzer
>> * CzechAnalyzer
>> * DutchAnalyzer
>> * FrenchAnalyzer
>> * GermanAnalyzer
>> * GreekAnalyzer
>> * ThaiAnalyzer
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by Shai Erera :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I think we should distinguish between what is a bug and what is an attempt of the tokenizer to produce a meaningful token. When the tokenizer outputs a HOST or ACRONYM token type, there's nothing that prevents you from putting a filter after the tokenizer that will use a UIMA Annotator (for example) and verify that the output token type is indeed correct.

For example, in the case of java.lang.NullPointerException we all understand it's not a HOST, but unfortunately our logic hasn't been translated well into computer instructions, yet :-). However you treat this token now is up to you:

- If you want to be able to search for the individual parts of the host, but still find the full host, I'd put a TokenFilter after the tokenizer that breaks the HOST to its parts and returns the parts along with the full host name. During query time I'd then remove that filter (i.e. create an Analyzer w/o that filter) and thus I'd be able to search for either "apache" or "www.apache.org".

- If you want to actually verify the output HOST is indeed a host, again, put a TokenFilter after the tokenizer and either apply your own simple hueristics (for example if there's a ".com", ".org", ".net" it's a HOST, otherwise it's not - I know these don't cover all HOST types, it's just an example), or validate that with an external tool, like a UIMA Annotator.

- You can also decide that a 2 parts HOST is not really a host, that way you solve the "I like Apache.They rock" problem, but miss a whole handful of hosts like "ibm.com", "apache.org", "google.com".

Again, IMO, the logic in the tokenizer today for HOSTs and ACRONYMs are "best effort" to produce a meaningful token. If we remove those rules, for example, it'd be impossible to detect them because the tokenizer is set to discard any stand alone "&", ".", "@" for example.

I'm going to send out another email to the list about a bug or incosistency I recently found in the COMPANY rule. I don't want to mix this thread with a different issue.

On Thu, Sep 4, 2008 at 5:17 AM, Mark Lassau <mlassau@...> wrote:
Grant Ingersoll (JIRA) wrote:
Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.

 
Grant, not sure what you mean by "b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one."
Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interested to get in on the discussion on the behaviour of <HOST>.
For instance, if you have a document like "visit www.apache.org", you currently won't get a hit if you search for "apache".
In an issue tracker like JIRA, we want to be able to search for "NullPointerException", and get a hit for the document "Application threw java.lang.NullPointerException".

Also note that the current implementation has problems if the document doesn't contain expected whitespace.
eg "I like Apache.They rock"
Will get tokenized to the following:
I                         <ALPHANUM>
like                    <ALPHANUM>
Apache.They    <HOST>
rock                   <ALPHANUM>

I don't think there is a simple one-size-fits-all answer to how this should behave. It depends on the context of the app that is using Lucene.
The best answer may be to make some of the behaviour configurable, or have a suite of specific analyzers?

Mark.
Most of the contributed Analyzers suffer from invalid recognition of acronyms.
------------------------------------------------------------------------------

               Key: LUCENE-1373
               URL: https://issues.apache.org/jira/browse/LUCENE-1373
           Project: Lucene - Java
        Issue Type: Bug
        Components: Analysis, contrib/analyzers
  Affects Versions: 2.3.2
          Reporter: Mark Lassau
          Priority: Minor

LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer
   

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...



[jira] Issue Comment Edited: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628563#action_12628563 ]

marklassau edited comment on LUCENE-1373 at 9/4/08 11:13 PM:
--------------------------------------------------------------

Just discovered LUCENE-1151, which attempts to make StandardAnalyzer NOT be buggy by default.
I think if the changes made to StandardAnalyzer here where moved to StandardTokenizer instead, then we would fix this issue.

      was (Author: marklassau):
    Just discovered LUCENE-1151, whcihc attempts to make StandardAnalyzer NOT be buggy by default.
I think if the changes made to StandardAnalyzer here where moved to StandardTokenizer instead, then we would fix this issue.
 

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Updated: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Lassau updated LUCENE-1373:
--------------------------------

    Attachment: LUCENE-1373.patch

Added a draft patch to fix the default behaviour of StandardTokenizer.
This basically involved moving the logic of LUCENE-1151 from StandardAnalyzer to StandardTokenizer.

I added a unit test for StandardTokenizer, but unfortunately don't have time to add tests for the language analyzers listed above (FrenchAnalyzer etc...).

I will be away for 3 weeks, so if anyone else wants to pick up this issue, that would be great ;) ... otherwise I will come back and look at it then.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725972#action_12725972 ]

Rob ten Hove commented on LUCENE-1373:
--------------------------------------

Is it possible that when a property has a value that ends on "Type" like "InputFileType" is not indexed when the OS language is Dutch due to the same bug? I have two installations of Alfresco 3 Labs with Lucene 2.1.0 autoinstalled and with exactly the same installation options (English as language for Alfresco) the main difference next to the Hardware is the OS language. In both cases XP with SP2 but one English and the other Dutch. In the installation on the Dutch OS three properties with values ending on Type could not be found whereas they are present in the English version.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727390#action_12727390 ]

Mark Lassau commented on LUCENE-1373:
-------------------------------------

@Rob
This issue is about how Lucene parses ACRONYM tokens, which must contain a dot (eg "I.B.M."), and so you problem is certainly not exactly the same.

Whether it is related to some other issue with Lucene analysers for different languages is not clear.
It depends on the workings of your application, and I would suggest you contact the Alfresco developers with this question.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729181#action_12729181 ]

Rob ten Hove commented on LUCENE-1373:
--------------------------------------

@Mark, thanks for your reply on my question. So far the developers that worked on the application I was talking about were able to find a workaround. One thing is certain: the token analyzer mistreats the content... whether the content is an acronym or just plain text... seems that it tries to interpret the content of database elements a bit too much rather than just treat it as plain content...

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Resolved: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1373.
----------------------------------------

    Resolution: Duplicate

Dup of LUCENE-2002.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-1373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...