Issue with SIMILAR TO and UTF8 on 2.5 Beta 2 (and 1)

View: New views
2 Messages — Rating Filter:   Alert me  

Issue with SIMILAR TO and UTF8 on 2.5 Beta 2 (and 1)

by Mithandir :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Greetings,

I've encountered a problem using SIMILAR TO syntax on UTF-8 fields. I
searched through the tracker but couldn't find this issue.

What seems to be happening is that when using SIMILAR TO on UTF-8 fields
doesn't get the right pattern length.
I've made a testcase with a bit of data in it that you can get from:
http://www.fantasycomic.com/t/testsimilarto.zip (both 2.5beta2 database
as gbak backup are in there).

In that testcase you'll find a table created with:

CREATE TABLE TEST (
 ANSIFIELD Varchar(100),
 UTF8FIELD Varchar(100) CHARACTER SET UTF8,
 UNICODECIFIELD Varchar(100) CHARACTER SET UTF8 COLLATE UNICODE_CI
);

This table was then filled with a number of records, with the same value
going into each three fields.

The default character set for the database is ISO8859_1. I have done all
my tests using flamerobin 0.8.6.1652 Unicode, connection charset was
ISO8859_1.

Now on that table if you do

SELECT * FROM Test WHERE Utf8Field SIMILAR TO 'DELL %' ;

You'll get 16 results. Amongst those results are incorrect matches like
for example 'DE SINGEL'. All 16 matches start with 'DE'

Change the query to:
SELECT * FROM Test WHERE Utf8Field SIMILAR TO 'DE %' ;

and you'll get 39 results, all starting with "D" but some incorrect
matches like 'DHR. F. BERENDE'.

Now, these same queries on AnsieField (which has the same data but
different character set) will give 3 and 6 results respectively and all
correct.
If you change the query to use LIKE instead of SIMILAR TO on Utf8Field
then you get the correct results.

To me this looks like some of the substring matching in the regexp
engine is using the wrong character set to calculate string sizes, so
instead of 'DELL %' it sees 'DE%' (because the 5 character 'DELL '
literal string is seen as a 2-char one).

I've tested this issue in Firebird 2.5 Beta 1 and Beta 2 running as
superserver on windows vista, 32bit. If you wish I can run the same
tests on xp32 bit and xp 64 bit.

Regards,

Mathias Dellaert


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: Issue with SIMILAR TO and UTF8 on 2.5 Beta 2 (and 1)

by Adriano dos Santos Fernandes-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mithandir wrote:
> Greetings,
>
> I've encountered a problem using SIMILAR TO syntax on UTF-8 fields. I
> searched through the tracker but couldn't find this issue.
>
Thanks.

I've committed a fix which is going to be in RC2.

Please test it on tomorrow snapshot.


Adriano


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel