Unsafe characters in a collation

View: New views
2 Messages — Rating Filter:   Alert me  

Unsafe characters in a collation

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I'm getting the list of unsafe characters in a collation using
ucol_getUnsafeSet() and looking at the properties of the characters. I can
divide the characters into several classes:

Surrogate characters - It makes sense that leading surrogates are unsafe,
but why are trailing surrogates unsafe?

Leading characters of contractions - These make sense.

Combining characters - These make sense.

Other - Close to 1000 characters fall into this bucket (even for the ROOT
collation). What other properties would make a character unsafe? Some
sample characters from this bucket:
              U+00C0 - U+00C5, U+00C7 - U+00CF (at quick glance, it looks
like most of accented Latin characters are in this bucket)
              U+0200 - U+021B
              U+1E00 - U+1E99

When generating upper and lower bounds for a collation based range, I trim
all trailing unsafe characters from the prefix for the bound. Currently I
am seeing occasional performance problems because the prefix is made
entirely of unsafe characters which results in the range spanning the
entire data set. I'm hoping that I can be a little smarter and only trim
the last 1 or 2 unsafe characters based on which class they fall into. (For
example, in the Slovak collation, the letter C is unsafe because of the CH
contraction. Today, If I am given a pattern starting ABCC, I trim it to AB
(which can greatly expand my search range). However, knowing that C is
unsafe because of the contraction, I should be able to assert that the
first C is safe and therefore use ABC as my trimmed prefix.)

Any advice or suggestions would be very welcome.

(ucol_getBound() doesn't work for me because it's not collation specific,
and I need the prefix as characters, not in sort key form.)


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Unsafe characters in a collation

by Vladimir Weinstein :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

These characters come from set [[:^tccc=0:][:^lccc=0:]] (see  
ucol_sit.cpp:1147). There are here because they have a non-zero  
leading or trailing canonical combining class. This in turn means  
that they can reorder with following combining characters (for  
example 00c0 0325 has a NFD form of 0041 0325 0300). This affects  
ordering.

Shortening the prefix can be done in many ways. Perhaps having  
different sets of unsafe characters based on the reason for their  
unsafeness could help there. Different categories of unsafe  
characters are clearly defined in procedure code.

Hope this helps.

Regards,
v.

On May 3, 2007, at 2:01 PM, Doug Doole wrote:

>
> I'm getting the list of unsafe characters in a collation using
> ucol_getUnsafeSet() and looking at the properties of the  
> characters. I can
> divide the characters into several classes:
>
> Surrogate characters - It makes sense that leading surrogates are  
> unsafe,
> but why are trailing surrogates unsafe?
>
> Leading characters of contractions - These make sense.
>
> Combining characters - These make sense.
>
> Other - Close to 1000 characters fall into this bucket (even for  
> the ROOT
> collation). What other properties would make a character unsafe? Some
> sample characters from this bucket:
>               U+00C0 - U+00C5, U+00C7 - U+00CF (at quick glance, it  
> looks
> like most of accented Latin characters are in this bucket)
>               U+0200 - U+021B
>               U+1E00 - U+1E99
>
> When generating upper and lower bounds for a collation based range,  
> I trim
> all trailing unsafe characters from the prefix for the bound.  
> Currently I
> am seeing occasional performance problems because the prefix is made
> entirely of unsafe characters which results in the range spanning the
> entire data set. I'm hoping that I can be a little smarter and only  
> trim
> the last 1 or 2 unsafe characters based on which class they fall  
> into. (For
> example, in the Slovak collation, the letter C is unsafe because of  
> the CH
> contraction. Today, If I am given a pattern starting ABCC, I trim  
> it to AB
> (which can greatly expand my search range). However, knowing that C is
> unsafe because of the contraction, I should be able to assert that the
> first C is safe and therefore use ABC as my trimmed prefix.)
>
> Any advice or suggestions would be very welcome.
>
> (ucol_getBound() doesn't work for me because it's not collation  
> specific,
> and I need the prefix as characters, not in sort key form.)
>
>
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> icu-support mailing list - icu-support@...
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu- 
> support


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support