|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
Unsafe characters in a collationI'm getting the list of unsafe characters in a collation using ucol_getUnsafeSet() and looking at the properties of the characters. I can divide the characters into several classes: Surrogate characters - It makes sense that leading surrogates are unsafe, but why are trailing surrogates unsafe? Leading characters of contractions - These make sense. Combining characters - These make sense. Other - Close to 1000 characters fall into this bucket (even for the ROOT collation). What other properties would make a character unsafe? Some sample characters from this bucket: U+00C0 - U+00C5, U+00C7 - U+00CF (at quick glance, it looks like most of accented Latin characters are in this bucket) U+0200 - U+021B U+1E00 - U+1E99 When generating upper and lower bounds for a collation based range, I trim all trailing unsafe characters from the prefix for the bound. Currently I am seeing occasional performance problems because the prefix is made entirely of unsafe characters which results in the range spanning the entire data set. I'm hoping that I can be a little smarter and only trim the last 1 or 2 unsafe characters based on which class they fall into. (For example, in the Slovak collation, the letter C is unsafe because of the CH contraction. Today, If I am given a pattern starting ABCC, I trim it to AB (which can greatly expand my search range). However, knowing that C is unsafe because of the contraction, I should be able to assert that the first C is safe and therefore use ABC as my trimmed prefix.) Any advice or suggestions would be very welcome. (ucol_getBound() doesn't work for me because it's not collation specific, and I need the prefix as characters, not in sort key form.) ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ icu-support mailing list - icu-support@... To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support |
|
|
Re: Unsafe characters in a collationThese characters come from set [[:^tccc=0:][:^lccc=0:]] (see
ucol_sit.cpp:1147). There are here because they have a non-zero leading or trailing canonical combining class. This in turn means that they can reorder with following combining characters (for example 00c0 0325 has a NFD form of 0041 0325 0300). This affects ordering. Shortening the prefix can be done in many ways. Perhaps having different sets of unsafe characters based on the reason for their unsafeness could help there. Different categories of unsafe characters are clearly defined in procedure code. Hope this helps. Regards, v. On May 3, 2007, at 2:01 PM, Doug Doole wrote: > > I'm getting the list of unsafe characters in a collation using > ucol_getUnsafeSet() and looking at the properties of the > characters. I can > divide the characters into several classes: > > Surrogate characters - It makes sense that leading surrogates are > unsafe, > but why are trailing surrogates unsafe? > > Leading characters of contractions - These make sense. > > Combining characters - These make sense. > > Other - Close to 1000 characters fall into this bucket (even for > the ROOT > collation). What other properties would make a character unsafe? Some > sample characters from this bucket: > U+00C0 - U+00C5, U+00C7 - U+00CF (at quick glance, it > looks > like most of accented Latin characters are in this bucket) > U+0200 - U+021B > U+1E00 - U+1E99 > > When generating upper and lower bounds for a collation based range, > I trim > all trailing unsafe characters from the prefix for the bound. > Currently I > am seeing occasional performance problems because the prefix is made > entirely of unsafe characters which results in the range spanning the > entire data set. I'm hoping that I can be a little smarter and only > trim > the last 1 or 2 unsafe characters based on which class they fall > into. (For > example, in the Slovak collation, the letter C is unsafe because of > the CH > contraction. Today, If I am given a pattern starting ABCC, I trim > it to AB > (which can greatly expand my search range). However, knowing that C is > unsafe because of the contraction, I should be able to assert that the > first C is safe and therefore use ABC as my trimmed prefix.) > > Any advice or suggestions would be very welcome. > > (ucol_getBound() doesn't work for me because it's not collation > specific, > and I need the prefix as characters, not in sort key form.) > > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > icu-support mailing list - icu-support@... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu- > support ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ icu-support mailing list - icu-support@... To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support |
| Free embeddable forum powered by Nabble | Forum Help |