Accent-insensitive searches

View: New views
9 Messages — Rating Filter:   Alert me  

Accent-insensitive searches

by josu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm working on an database application. Items in the database are all in spanish language. It's mandatory that searches are accent-insensitive, meaning that, for example, a search for the word 'electrico' (no accent) must return entrances containing 'eléctrico' (with accent).

Searching the web for a solution, I find I must set these two properties when creating the database:

territory=es_ES
collation=TERRITORY_BASED

But it still doesn't work this way. Looks like the default collation for es_ES is still accent-sensitive.

So I try to use a custom collator that will behave as I need to. I find some instructions for this in the following blog:

 http://blogs.sun.com/kah/entry/user_defined_collation_in_apache

 In brief, I define a new CollatorProvider and register it with the JVM. Here's the code for this class:


public class IgnoraAcentosCollatorProvider extends java.text.spi.CollatorProvider {

    @Override
    public Collator getInstance(Locale locale) {
        if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
            throw new IllegalArgumentException("Solo acepta es_ES_accentinsensitive");
        }
        Collator c=Collator.getInstance(new Locale("es","ES"));
        c.setStrength(Collator.PRIMARY);
        return c;
    }

    @Override
    public Locale[] getAvailableLocales() {
        return new Locale[]{
            new Locale("es","ES","accentinsensitive")
        };
    }

}


 It simply takes the default es_ES Collator and changes strength to PRIMARY. This makes the collator return 0 when comparing 'electrico' and 'eléctrico'.

After making sure this new Collator is available for the JVM, I re-start Derby and make a new database, now setting territory=es_ES_accentinsensitive

The database is created without errors (meaning Derby reaches my Collator), but searches are still accent-sensitive (no matter if I use = or LIKE operators).

Any clue? I made intensive searches about this issue but I found no solution. I can avoid the problem simply using MySQL (the default spanish configuration has already the desired behaviour) but I would like to keep on using Derby if possible.

I'm using JavaDB-Derby 10.4.2.1

Thanks.

Re: Accent-insensitive searches

by Sylvain Leroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Josu,

And sorry for the late reply.


I've tried the exact same thing as you yesterday - except I have used
Locale.FRENCH as /base/ locale:
 >         Collator c=Collator.getInstance(Locale.FRENCH);

It works like a charm (both with = and LIKE).
I am using Apache Derby 10.5.1.1 and Sun JDK 1.6.0_12.


Maybe you should try with a more recent version of Derby (?)


Otherwise, check that you've specified your custom collator at DB
creation time. Not when booting the DB: since even if you specify an
other collator when you boot an existing DB, it's still the collator
specified at creation time that is used.


Finally, the most doubtful one: accent difference might be a PRIMARY
difference for the default es_ES collator (??). But, that would be
really surprising...


Let us know if you find the answer!

Best regards,
Sylvain


josu a écrit :

>
> I'm working on an database application. Items in the database are all in
> spanish language. It's mandatory that searches are accent-insensitive,
> meaning that, for example, a search for the word 'electrico' (no accent)
> must return entrances containing 'eléctrico' (with accent).
>
> Searching the web for a solution, I find I must set these two properties
> when creating the database:
>
> territory=es_ES
> collation=TERRITORY_BASED
>
> But it still doesn't work this way. Looks like the default collation for
> es_ES is still accent-sensitive.
>
> So I try to use a custom collator that will behave as I need to. I find some
> instructions for this in the following blog:
>
>   http://blogs.sun.com/kah/entry/user_defined_collation_in_apache
> http://blogs.sun.com/kah/entry/user_defined_collation_in_apache 
>
>  In brief, I define a new CollatorProvider and register it with the JVM.
> Here's the code for this class:
>
>
> public class IgnoraAcentosCollatorProvider extends
> java.text.spi.CollatorProvider {
>
>     @Override
>     public Collator getInstance(Locale locale) {
>         if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
>             throw new IllegalArgumentException("Solo acepta
> es_ES_accentinsensitive");
>         }
>         Collator c=Collator.getInstance(new Locale("es","ES"));
>         c.setStrength(Collator.PRIMARY);
>         return c;
>     }
>
>     @Override
>     public Locale[] getAvailableLocales() {
>         return new Locale[]{
>             new Locale("es","ES","accentinsensitive")
>         };
>     }
>
> }
>
>
>  It simply takes the default es_ES Collator and changes strength to PRIMARY.
> This makes the collator return 0 when comparing 'electrico' and 'eléctrico'.
>
> After making sure this new Collator is available for the JVM, I re-start
> Derby and make a new database, now setting territory=es_ES_accentinsensitive
>
> The database is created without errors (meaning Derby reaches my Collator),
> but searches are still accent-sensitive (no matter if I use = or LIKE
> operators).
>
> Any clue? I made intensive searches about this issue but I found no
> solution. I can avoid the problem simply using MySQL (the default spanish
> configuration has already the desired behaviour) but I would like to keep on
> using Derby if possible.
>
> I'm using JavaDB-Derby 10.4.2.1
>
> Thanks.


--
Website: http://www.chicoree.fr



Re: Accent-insensitive searches

by Knut Anders Hatlen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

josu <josu.eh.mondo@...> writes:

> I'm working on an database application. Items in the database are all in
> spanish language. It's mandatory that searches are accent-insensitive,
> meaning that, for example, a search for the word 'electrico' (no accent)
> must return entrances containing 'eléctrico' (with accent).
>
> Searching the web for a solution, I find I must set these two properties
> when creating the database:
>
> territory=es_ES
> collation=TERRITORY_BASED
>
> But it still doesn't work this way. Looks like the default collation for
> es_ES is still accent-sensitive.
>
> So I try to use a custom collator that will behave as I need to. I find some
> instructions for this in the following blog:
>
>   http://blogs.sun.com/kah/entry/user_defined_collation_in_apache
> http://blogs.sun.com/kah/entry/user_defined_collation_in_apache 
>
>  In brief, I define a new CollatorProvider and register it with the JVM.
> Here's the code for this class:
>
>
> public class IgnoraAcentosCollatorProvider extends
> java.text.spi.CollatorProvider {
>
>     @Override
>     public Collator getInstance(Locale locale) {
>         if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
>             throw new IllegalArgumentException("Solo acepta
> es_ES_accentinsensitive");
>         }
>         Collator c=Collator.getInstance(new Locale("es","ES"));
>         c.setStrength(Collator.PRIMARY);
>         return c;
>     }
>
>     @Override
>     public Locale[] getAvailableLocales() {
>         return new Locale[]{
>             new Locale("es","ES","accentinsensitive")
>         };
>     }
>
> }
>
>
>  It simply takes the default es_ES Collator and changes strength to PRIMARY.
> This makes the collator return 0 when comparing 'electrico' and 'eléctrico'.
>
> After making sure this new Collator is available for the JVM, I re-start
> Derby and make a new database, now setting territory=es_ES_accentinsensitive
>
> The database is created without errors (meaning Derby reaches my Collator),
> but searches are still accent-sensitive (no matter if I use = or LIKE
> operators).
>
> Any clue? I made intensive searches about this issue but I found no
> solution. I can avoid the problem simply using MySQL (the default spanish
> configuration has already the desired behaviour) but I would like to keep on
> using Derby if possible.
>
> I'm using JavaDB-Derby 10.4.2.1

Hi,

It seems to work in my environment. I used the collator provider class
that you posted and performed the steps below. (Note that when you set
the strength of the collator to PRIMARY, it will be case-insensitive in
addition to being accent-insensitive. Making it accent-insensitive and
at the same time case-sensitive is probably more work, but I think it
should be doable.)

kah@tecra:/tmp/coll % javac IgnoraAcentosCollatorProvider.java
kah@tecra:/tmp/coll % mkdir -p META-INF/services            
kah@tecra:/tmp/coll % echo IgnoraAcentosCollatorProvider > META-INF/services/java.text.spi.CollatorProvider                                                     kah@tecra:/tmp/coll % jar cf coll.jar IgnoraAcentosCollatorProvider* META-INF
kah@tecra:/tmp/coll % java -version      
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Server VM (build 14.1-b02, mixed mode)
kah@tecra:/tmp/coll % java -Djava.ext.dirs=. -jar /code/derby/oldreleases/10.4.2.0/derbyrun.jar ij
ij version 10.4
ij> connect 'jdbc:derby:mydb;create=true;territory=es_ES_accentinsensitive;collation=TERRITORY_BASED';
ij> create table t(x varchar(20));
0 rows inserted/updated/deleted
ij> insert into t values 'electrico','eléctrico','Electrico';
3 rows inserted/updated/deleted
ij> select * from t where x='electrico';
X                  
--------------------
electrico          
eléctrico          
Electrico          

3 rows selected
ij>


You can verify that the database was created with the correct territory
and collation by evaluating this in IJ and see that "TERRITORY_BASED" is
returned:

ij> values syscs_util.syscs_get_database_property('derby.database.collation');
1                                                                                                                              
-------------------------
TERRITORY_BASED

1 row selected

And also check that service.properties in the database directory
contains the following line:

derby.serviceLocale=es_ES_accentinsensitive

Hope this helps,

--
Knut Anders

Re: Accent-insensitive searches

by dev@xx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bonsoir Sylvain,

I was just starting to look for the way to implement case insensitive and
accent insensitive SELECT in Derby when I received this mail.

I tried what you suggest but I can't make it working. I kown, it is not a
purely Derby question but I guess that those who are not familiar with spi
may have the same problem, so I post on this mailing list.

I have defined my jar with the META-INF file and ServiceLoader.load(
java.text.spi.CollatorProvider.class ) call is returning my class correctly.
Anyway Derby throw ERROR XBM04 as it can't find my provider.
I search into the java API source and found that
ServiceLoader.loadInstalled() was called by
sun.util.LocaleServiceProviderPool (Locale.getAvailableLocales())
And javadoc says : "This method is intended for use when only installed
providers are desired. The resulting service will only find and load
providers that have been installed into the current Java virtual machine;
providers on the application's class path will be ignored."

Does that mean I have to deploy my jar CollatorProvider somewhere else than
in the application class path ? It should be a deployment problem for me...
What means "installed into the current Java virtual machine" ?

It is a shame that DERBY doesn't implement an easier way to do such frequent
action. A case insensitive, accent insensitive behaviour option **should
be** a standard DERBY feature (as SQL Server has and ORACLE don't have) !

Thanks

jylaxx


----- Original Message -----
From: "Sylvain Leroux" <sl20@...>
To: "Derby Discussion" <derby-user@...>
Sent: Friday, September 11, 2009 10:33 AM
Subject: Re: Accent-insensitive searches


> Hi Josu,
>
> And sorry for the late reply.
>
>
> I've tried the exact same thing as you yesterday - except I have used
> Locale.FRENCH as /base/ locale:
> >         Collator c=Collator.getInstance(Locale.FRENCH);
>
> It works like a charm (both with = and LIKE).
> I am using Apache Derby 10.5.1.1 and Sun JDK 1.6.0_12.
>
>
> Maybe you should try with a more recent version of Derby (?)
>
>
> Otherwise, check that you've specified your custom collator at DB creation
> time. Not when booting the DB: since even if you specify an other collator
> when you boot an existing DB, it's still the collator specified at
> creation time that is used.
>
>
> Finally, the most doubtful one: accent difference might be a PRIMARY
> difference for the default es_ES collator (??). But, that would be really
> surprising...
>
>
> Let us know if you find the answer!
>
> Best regards,
> Sylvain
>
>
> josu a écrit :
>>
>> I'm working on an database application. Items in the database are all in
>> spanish language. It's mandatory that searches are accent-insensitive,
>> meaning that, for example, a search for the word 'electrico' (no accent)
>> must return entrances containing 'eléctrico' (with accent).
>>
>> Searching the web for a solution, I find I must set these two properties
>> when creating the database: territory=es_ES
>> collation=TERRITORY_BASED
>>
>> But it still doesn't work this way. Looks like the default collation for
>> es_ES is still accent-sensitive.
>>
>> So I try to use a custom collator that will behave as I need to. I find
>> some
>> instructions for this in the following blog:
>>
>>   http://blogs.sun.com/kah/entry/user_defined_collation_in_apache
>> http://blogs.sun.com/kah/entry/user_defined_collation_in_apache In brief,
>> I define a new CollatorProvider and register it with the JVM.
>> Here's the code for this class:
>>
>>
>> public class IgnoraAcentosCollatorProvider extends
>> java.text.spi.CollatorProvider {
>>
>>     @Override
>>     public Collator getInstance(Locale locale) {
>>         if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
>>             throw new IllegalArgumentException("Solo acepta
>> es_ES_accentinsensitive");
>>         }
>>         Collator c=Collator.getInstance(new Locale("es","ES"));
>>         c.setStrength(Collator.PRIMARY);
>>         return c;
>>     }
>>
>>     @Override
>>     public Locale[] getAvailableLocales() {
>>         return new Locale[]{
>>             new Locale("es","ES","accentinsensitive")
>>         };
>>     }
>>
>> }
>>
>>
>>  It simply takes the default es_ES Collator and changes strength to
>> PRIMARY.
>> This makes the collator return 0 when comparing 'electrico' and
>> 'eléctrico'.
>>
>> After making sure this new Collator is available for the JVM, I re-start
>> Derby and make a new database, now setting
>> territory=es_ES_accentinsensitive
>>
>> The database is created without errors (meaning Derby reaches my
>> Collator),
>> but searches are still accent-sensitive (no matter if I use = or LIKE
>> operators).
>>
>> Any clue? I made intensive searches about this issue but I found no
>> solution. I can avoid the problem simply using MySQL (the default spanish
>> configuration has already the desired behaviour) but I would like to keep
>> on
>> using Derby if possible.
>>
>> I'm using JavaDB-Derby 10.4.2.1
>>
>> Thanks.
>
>
> --
> Website: http://www.chicoree.fr
>
>
>


Re: Accent-insensitive searches

by dev@xx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Well I found part of the answer by my own.
So the jar need to be in the jre lib/ext directory.
Is that compulsory ? I deploy all my application in only one directory, it
is really a problem to have to put something into the jre/lib/ext  !!

jylaxx

----- Original Message -----
From: "dev@xx" <dev@...>
To: "Derby Discussion" <derby-user@...>
Sent: Friday, September 11, 2009 6:57 PM
Subject: Re: Accent-insensitive searches


> Bonsoir Sylvain,
>
> I was just starting to look for the way to implement case insensitive and
> accent insensitive SELECT in Derby when I received this mail.
>
> I tried what you suggest but I can't make it working. I kown, it is not a
> purely Derby question but I guess that those who are not familiar with spi
> may have the same problem, so I post on this mailing list.
>
> I have defined my jar with the META-INF file and ServiceLoader.load(
> java.text.spi.CollatorProvider.class ) call is returning my class
> correctly.
> Anyway Derby throw ERROR XBM04 as it can't find my provider.
> I search into the java API source and found that
> ServiceLoader.loadInstalled() was called by
> sun.util.LocaleServiceProviderPool (Locale.getAvailableLocales())
> And javadoc says : "This method is intended for use when only installed
> providers are desired. The resulting service will only find and load
> providers that have been installed into the current Java virtual machine;
> providers on the application's class path will be ignored."
>
> Does that mean I have to deploy my jar CollatorProvider somewhere else
> than in the application class path ? It should be a deployment problem for
> me...
> What means "installed into the current Java virtual machine" ?
>
> It is a shame that DERBY doesn't implement an easier way to do such
> frequent action. A case insensitive, accent insensitive behaviour option
> **should be** a standard DERBY feature (as SQL Server has and ORACLE don't
> have) !
>
> Thanks
>
> jylaxx
>
>
> ----- Original Message -----
> From: "Sylvain Leroux" <sl20@...>
> To: "Derby Discussion" <derby-user@...>
> Sent: Friday, September 11, 2009 10:33 AM
> Subject: Re: Accent-insensitive searches
>
>
>> Hi Josu,
>>
>> And sorry for the late reply.
>>
>>
>> I've tried the exact same thing as you yesterday - except I have used
>> Locale.FRENCH as /base/ locale:
>> >         Collator c=Collator.getInstance(Locale.FRENCH);
>>
>> It works like a charm (both with = and LIKE).
>> I am using Apache Derby 10.5.1.1 and Sun JDK 1.6.0_12.
>>
>>
>> Maybe you should try with a more recent version of Derby (?)
>>
>>
>> Otherwise, check that you've specified your custom collator at DB
>> creation time. Not when booting the DB: since even if you specify an
>> other collator when you boot an existing DB, it's still the collator
>> specified at creation time that is used.
>>
>>
>> Finally, the most doubtful one: accent difference might be a PRIMARY
>> difference for the default es_ES collator (??). But, that would be really
>> surprising...
>>
>>
>> Let us know if you find the answer!
>>
>> Best regards,
>> Sylvain
>>
>>
>> josu a écrit :
>>>
>>> I'm working on an database application. Items in the database are all in
>>> spanish language. It's mandatory that searches are accent-insensitive,
>>> meaning that, for example, a search for the word 'electrico' (no accent)
>>> must return entrances containing 'eléctrico' (with accent).
>>>
>>> Searching the web for a solution, I find I must set these two properties
>>> when creating the database: territory=es_ES
>>> collation=TERRITORY_BASED
>>>
>>> But it still doesn't work this way. Looks like the default collation for
>>> es_ES is still accent-sensitive.
>>>
>>> So I try to use a custom collator that will behave as I need to. I find
>>> some
>>> instructions for this in the following blog:
>>>
>>>   http://blogs.sun.com/kah/entry/user_defined_collation_in_apache
>>> http://blogs.sun.com/kah/entry/user_defined_collation_in_apache In
>>> brief, I define a new CollatorProvider and register it with the JVM.
>>> Here's the code for this class:
>>>
>>>
>>> public class IgnoraAcentosCollatorProvider extends
>>> java.text.spi.CollatorProvider {
>>>
>>>     @Override
>>>     public Collator getInstance(Locale locale) {
>>>         if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
>>>             throw new IllegalArgumentException("Solo acepta
>>> es_ES_accentinsensitive");
>>>         }
>>>         Collator c=Collator.getInstance(new Locale("es","ES"));
>>>         c.setStrength(Collator.PRIMARY);
>>>         return c;
>>>     }
>>>
>>>     @Override
>>>     public Locale[] getAvailableLocales() {
>>>         return new Locale[]{
>>>             new Locale("es","ES","accentinsensitive")
>>>         };
>>>     }
>>>
>>> }
>>>
>>>
>>>  It simply takes the default es_ES Collator and changes strength to
>>> PRIMARY.
>>> This makes the collator return 0 when comparing 'electrico' and
>>> 'eléctrico'.
>>>
>>> After making sure this new Collator is available for the JVM, I re-start
>>> Derby and make a new database, now setting
>>> territory=es_ES_accentinsensitive
>>>
>>> The database is created without errors (meaning Derby reaches my
>>> Collator),
>>> but searches are still accent-sensitive (no matter if I use = or LIKE
>>> operators).
>>>
>>> Any clue? I made intensive searches about this issue but I found no
>>> solution. I can avoid the problem simply using MySQL (the default
>>> spanish
>>> configuration has already the desired behaviour) but I would like to
>>> keep on
>>> using Derby if possible.
>>>
>>> I'm using JavaDB-Derby 10.4.2.1
>>>
>>> Thanks.
>>
>>
>> --
>> Website: http://www.chicoree.fr
>>
>>
>>
>
>


Re: Accent-insensitive searches

by Sylvain Leroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut Jylaxx,

dev@xx a écrit :
>
> Well I found part of the answer by my own.
> So the jar need to be in the jre lib/ext directory.
> Is that compulsory ? I deploy all my application in only one directory,
> it is really a problem to have to put something into the jre/lib/ext  !!
>
> jylaxx

You may override the standard Java property java.ext.dirs to specify an
other directory for your Java extensions:

java -Djava.ext.dirs=/path/to/my/ext ...

Starting with JDK6 (I think), you may even specify several directories:
java -Djava.ext.dirs=/path/to/my/ext:/usr/local/lib/java/ext ...

See:
http://java.sun.com/docs/books/tutorial/ext/basics/install.html

Kind regards,
Sylvain

--
Website: http://www.chicoree.fr



Re: Accent-insensitive searches

by josu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks Sylvain and Knut for your answers.

Sylvain Leroux wrote:
Maybe you should try with a more recent version of Derby (?)
Looks like Knut tried and succeeded using version 10.4.2.0...

Sylvain Leroux wrote:
Otherwise, check that you've specified your custom collator at DB
creation time. Not when booting the DB: since even if you specify an
other collator when you boot an existing DB, it's still the collator
specified at creation time that is used.
That's the way I'm doing it

Sylvain Leroux wrote:
Finally, the most doubtful one: accent difference might be a PRIMARY
difference for the default es_ES collator (??). But, that would be
really surprising...
I already checked that the collator works right when setting strength to PRIMARY

Knut Anders Hatlen wrote:
You can verify that the database was created with the correct territory
and collation by evaluating this in IJ and see that "TERRITORY_BASED" is
returned:
...
And also check that service.properties in the database directory
contains the following line:
 
derby.serviceLocale=es_ES_accentinsensitive
I checked both, and they're ok. But the result of the queries is still accent-sensitive. I'm working with Windows XP and it seems you're working with Linux, could that be the reason of the different behaviour?

Thanks again for your help,

Josu Larrea

Re: Accent-insensitive searches

by Knut Anders Hatlen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

josu <josu.eh.mondo@...> writes:

> Knut Anders Hatlen wrote:
>>
>> You can verify that the database was created with the correct territory
>> and collation by evaluating this in IJ and see that "TERRITORY_BASED" is
>> returned:
>> ...
>> And also check that service.properties in the database directory
>> contains the following line:
>>  
>> derby.serviceLocale=es_ES_accentinsensitive
>>
>
> I checked both, and they're ok. But the result of the queries is still
> accent-sensitive. I'm working with Windows XP and it seems you're working
> with Linux, could that be the reason of the different behaviour?

Maybe. I've only tried it on Solaris, and I don't know if there are any
differences on other platforms.

To check that the Java VM picks up your collator correctly, you could
try this code and see what it prints:

    Locale locale = new Locale("es", "ES", "accentinsensitive");
    Collator c = Collator.getInstance(locale);
    System.out.println(c.compare("electrico", "eléctrico"));

"0" should be printed if your custom collator has been correctly
installed.

--
Knut Anders

Re: Accent-insensitive searches

by josu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks again. I already checked that, and it's properly working. So it seems to be Derby who isn't using my collator when making comparisons, even if I set it at the databases collator...

Knut Anders Hatlen wrote:
josu <josu.eh.mondo@gmail.com> writes:
> Knut Anders Hatlen wrote:
>>
>> You can verify that the database was created with the correct territory
>> and collation by evaluating this in IJ and see that "TERRITORY_BASED" is
>> returned:
>> ...
>> And also check that service.properties in the database directory
>> contains the following line:
>>  
>> derby.serviceLocale=es_ES_accentinsensitive
>>
>
> I checked both, and they're ok. But the result of the queries is still
> accent-sensitive. I'm working with Windows XP and it seems you're working
> with Linux, could that be the reason of the different behaviour?

Maybe. I've only tried it on Solaris, and I don't know if there are any
differences on other platforms.

To check that the Java VM picks up your collator correctly, you could
try this code and see what it prints:

    Locale locale = new Locale("es", "ES", "accentinsensitive");
    Collator c = Collator.getInstance(locale);
    System.out.println(c.compare("electrico", "eléctrico"));

"0" should be printed if your custom collator has been correctly
installed.

--
Knut Anders