On Mon, Apr 2, 2012 at 1:25 PM, Richard Hipp <drh@...> wrote:
> On Mon, Apr 2, 2012 at 2:03 PM, Simon Slavin <slavins@...> wrote:
>> I think ... a higher priority than that would be handling Unicode
>> correctly. And having Unicode support would be useful in writing the code
>> which handles dates.
> size of SQLite library: approx 500 KB
> size of ICU library: approx 21,919 KB
> The ICU library (needed to handle Unicode "correctly") is over 40x larger
> than SQLite. Can you understand then why we don't want to make SQLite
> dependent upon ICU?
> If you really need correct ICU support, SQLite will optionally link with
> ICU and use it. But *requiring* SQLite to link against ICU is a
> deal-breaker for many users.
Also, Unicode collation is typically orders of magnitude slower than
US-ASCII collation. This comes up a lot in other contexts,
particularly as the various OSes have begun defaulting to Unicode
locales. I've seen ls(1) of directories with millions of files run as
fast as the output device permits when run in the C locale (in less
than 1 second when tmpfs), but take many minutes when in a UTF-8
locale, and that's without any use of normalization. But mostly this
is a result of Unicode collation in libc being awful.
The OpenSolaris u8_textprep code is designed to make u8_str*cmp()
really fast, though not quite as fast as the C locale strcmp(), when
strings are mostly ASCII and even when they are not because
u8_textprep does no memory allocation for normalization-insensitive
comparison and has a fast-path for comparing substrings of two or more
ASCII codepoints. This is the main reason that I'd recommend