[JIRA] Created: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

View: New views
6 Messages — Rating Filter:   Alert me  

[JIRA] Created: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
------------------------------------------------------------------------------------------------

         Key: DSY-467
         URL: http://issues.cocoondev.org//browse/DSY-467
     Project: Daisy
        Type: Feature Wish
  Components: Querying and indexing (repository)  
    Versions: 2.0.1    
 Reporter: Geoffrey De Smet


In this file:
  http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java

we find this code:
    private IndexWriter constructIndexWriter() throws IOException {
        return new IndexWriter(indexDirectory, new StandardAnalyzer());
    }

which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
The solution is to use a DutchAnalyzer instead.

A solution might be: ask the language code of the daisy branch,
then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
and do a fall back on StandardAnalyzer if it doesn't exist.

Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/site/services/daisy/daisysupport.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

[JIRA] Commented: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

    [ http://issues.cocoondev.org//browse/DSY-467?page=comments#action_13166 ]

Geoffrey De Smet commented on DSY-467:
--------------------------------------

Meanwhile I 've asked if Lucene has some sort of "AutomaticAnalyzerResolver based on Locale":
http://thread.gmane.org/gmane.comp.jakarta.lucene.user/26709

> Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
> ------------------------------------------------------------------------------------------------
>
>          Key: DSY-467
>          URL: http://issues.cocoondev.org//browse/DSY-467
>      Project: Daisy
>         Type: Feature Wish
>   Components: Querying and indexing (repository)
>     Versions: 2.0.1
>     Reporter: Geoffrey De Smet

>
> In this file:
>   http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java
> we find this code:
>     private IndexWriter constructIndexWriter() throws IOException {
>         return new IndexWriter(indexDirectory, new StandardAnalyzer());
>     }
> which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
> The solution is to use a DutchAnalyzer instead.
> A solution might be: ask the language code of the daisy branch,
> then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
> and do a fall back on StandardAnalyzer if it doesn't exist.
> Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/site/services/daisy/daisysupport.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

[JIRA] Commented: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

    [ http://issues.cocoondev.org//browse/DSY-467?page=comments#action_13167 ]

Bruno Dumon commented on DSY-467:
---------------------------------

Making this adjustment wouldn't be too difficult, but the harder part is that one should typically use the same analyzer for searching as for indexing. Any smart ideas about how to handle this would be helpful.

One approach might be to keep two indexes: one language-neutral one and one to be used when the language is known (i.e. when the search is limited to a certain language variant).

> Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
> ------------------------------------------------------------------------------------------------
>
>          Key: DSY-467
>          URL: http://issues.cocoondev.org//browse/DSY-467
>      Project: Daisy
>         Type: Feature Wish
>   Components: Querying and indexing (repository)
>     Versions: 2.0.1
>     Reporter: Geoffrey De Smet

>
> In this file:
>   http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java
> we find this code:
>     private IndexWriter constructIndexWriter() throws IOException {
>         return new IndexWriter(indexDirectory, new StandardAnalyzer());
>     }
> which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
> The solution is to use a DutchAnalyzer instead.
> A solution might be: ask the language code of the daisy branch,
> then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
> and do a fall back on StandardAnalyzer if it doesn't exist.
> Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/site/services/daisy/daisysupport.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

[JIRA] Commented: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

    [ http://issues.cocoondev.org//browse/DSY-467?page=comments#action_13168 ]

Geoffrey De Smet commented on DSY-467:
--------------------------------------

In our use case everything is dutch, so the ability to just use the locale of the server, would hack-fix it, but it would be pretty unstable (as it depends on the vm locale).

However in the bigger use case, it's indeed pretty hard, there are a couple of issues:
- Creating a new language doesn't verify the the language code, but this wouldn't be needed with the fallback on standardAnalyzer.
- Each language might need its own index, however some could share an index.
- If a language is modified, its documents would need to be re-indexed
- To be able to do a general search (in every language), there indeed still needs to be a language-neutral index.
- What impact will indexing each document variant into maximum 2 indexes (instead 1) have?

I was thinking of implementing it as a quick-patch on FullTextIndexImpl, but it looks like it's a bit out of my league for now :)
Maybe something for daisy 2.2 or later? We can recommend the use of lucene proximity searches ("werk~") for now.

> Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
> ------------------------------------------------------------------------------------------------
>
>          Key: DSY-467
>          URL: http://issues.cocoondev.org//browse/DSY-467
>      Project: Daisy
>         Type: Feature Wish
>   Components: Querying and indexing (repository)
>     Versions: 2.0.1
>     Reporter: Geoffrey De Smet

>
> In this file:
>   http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java
> we find this code:
>     private IndexWriter constructIndexWriter() throws IOException {
>         return new IndexWriter(indexDirectory, new StandardAnalyzer());
>     }
> which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
> The solution is to use a DutchAnalyzer instead.
> A solution might be: ask the language code of the daisy branch,
> then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
> and do a fall back on StandardAnalyzer if it doesn't exist.
> Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/site/services/daisy/daisysupport.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

[JIRA] Commented: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

    [ http://issues.cocoondev.org//browse/DSY-467?page=comments#action_13734 ]

Karel Vervaeke commented on DSY-467:
------------------------------------

Would it be possible to specify multiple analyzers for searching? I think the impact would be limited (assuming analyzers usually produce the same terms)

> Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
> ------------------------------------------------------------------------------------------------
>
>          Key: DSY-467
>          URL: http://issues.cocoondev.org//browse/DSY-467
>      Project: Daisy
>         Type: Feature Wish
>   Components: Repository - querying and indexing
>     Versions: 2.0.1
>     Reporter: Geoffrey De Smet

>
> In this file:
>   http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java
> we find this code:
>     private IndexWriter constructIndexWriter() throws IOException {
>         return new IndexWriter(indexDirectory, new StandardAnalyzer());
>     }
> which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
> The solution is to use a DutchAnalyzer instead.
> A solution might be: ask the language code of the daisy branch,
> then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
> and do a fall back on StandardAnalyzer if it doesn't exist.
> Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

[JIRA] Updated: (DSY-467) Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer

by JIRA issues@cocoondev.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

     [ http://issues.cocoondev.org//browse/DSY-467?page=all ]

Aurélien Pelletier updated DSY-467:
-----------------------------------

    Attachment: luceneAnalyser.patch

This patch was made on the 2.3 branch of the daisy-repository-server-impl project.
It let you configure the LuceneAnalyser used for indexing and searching.

to change the Analyser just specify it under configuration in the /daisy/repository/fullTextIndex section of the myconfig file.

  <target path="/daisy/repository/fullTextIndex">
    <configuration>
      <!-- directory where index files should be stored -->
      <indexDirectory>${daisy.datadir}/indexstore</indexDirectory>
      <luceneAnalyzerClass>org.apache.lucene.analysis.fr.FrenchAnalyzer</luceneAnalyzerClass>
    </configuration>
  </target>

lucene-analyzers-2.2.0.jar sould be in your maven repo
http://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers/2.2.0/lucene-analyzers-2.2.0.jar


> Support language specific Analyzer (for example DutchAnalyzer) and fall back on StandardAnalyzer
> ------------------------------------------------------------------------------------------------
>
>          Key: DSY-467
>          URL: http://issues.cocoondev.org//browse/DSY-467
>      Project: Daisy
>         Type: Feature Wish
>   Components: Repository - querying and indexing
>     Versions: 2.0.1
>     Reporter: Geoffrey De Smet
>  Attachments: luceneAnalyser.patch
>
> In this file:
>   http://svn.cocoondev.org/repos/daisy/tags/RELEASE_2_0_0/daisy/repository/server/src/java/org/outerj/daisy/ftindex/FullTextIndexImpl.java
> we find this code:
>     private IndexWriter constructIndexWriter() throws IOException {
>         return new IndexWriter(indexDirectory, new StandardAnalyzer());
>     }
> which says that lucene uses StandardAnalyzer, however when we insert words like "werken", "gewerkt" and search for "werkte", it doesn't find it.
> The solution is to use a DutchAnalyzer instead.
> A solution might be: ask the language code of the daisy branch,
> then check if there's a Analyzer for that (for example "nl" => DutchAnalyzer)
> and do a fall back on StandardAnalyzer if it doesn't exist.
> Problem is that the current FullTextIndexImpl probably uses a singleton analyzer, independ of in which branch scope it's processed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.cocoondev.org//secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy