Case and accent insensitive string search

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Case and accent insensitive string search

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I'm trying to set up a string search where a collation, text and search
string are passed in and the position of the match is passed out.

Simplifying the code, I have:

search(collName, text, search)
   ucol = ucol_openFromShortString(collName)
   locale = ucol_getLocaleByType(ucol)
   ubrk = ubrk_openFromCollator(UBRK_CHARACTER, locale, text)
   usearch = usearch_openFromCollator(search, text, ucol, ubrk)
   return(usearch_first(usearch)

When I first tried this, I didn't have the character break iterator and I
was getting incorrect results for accented characters and strength 3
collations. I saw the comments in the ICU documentation and added in the
character break iterator.

Now, however, I'm having trouble with combining characters and strength 1
collations.

If I pass in:
   collName = LEN_S1
   text = 0043 00D4 0054 00C9   (CÔTÉ)
   search = 004F   (O)
the function returns offset 1 as expected.

However if I pass in:
   collName = LEN_S1
   text = 0043 004F 0302 0054 00C9   (CO^TÉ)
   search = 004F   (O)
the search fails and returns -1.

So, can anyone tell me what I'm missing here?


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Parent Message unknown Re: Case and accent insensitive string search

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've been playing with this a bit more, but I'm still stumped. A couple
things I checked:
- The character break iterator is properly determining the character
boundaries
- The language for the collator doesn't seem to make any difference

So, if anyone has a suggestion on how I can get this working properly, I
would really appreciate it.

> I'm trying to set up a string search where a collation, text and
> search string are passed in and the position of the match is passed out.
>
> Simplifying the code, I have:
>
> search(collName, text, search)
>    ucol = ucol_openFromShortString(collName)
>    locale = ucol_getLocaleByType(ucol)
>    ubrk = ubrk_openFromCollator(UBRK_CHARACTER, locale, text)
>    usearch = usearch_openFromCollator(search, text, ucol, ubrk)
>    return(usearch_first(usearch)
>
> When I first tried this, I didn't have the character break iterator
> and I was getting incorrect results for accented characters and
> strength 3 collations. I saw the comments in the ICU documentation
> and added in the character break iterator.
>
> Now, however, I'm having trouble with combining characters and
> strength 1 collations.
>
> If I pass in:
>    collName = LEN_S1
>    text = 0043 00D4 0054 00C9   (CÔTÉ)
>    search = 004F   (O)
> the function returns offset 1 as expected.
>
> However if I pass in:
>    collName = LEN_S1
>    text = 0043 004F 0302 0054 00C9   (CO^TÉ)
>    search = 004F   (O)
> the search fails and returns -1.
>
> So, can anyone tell me what I'm missing here?


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Andy Heninger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 4/24/07, Doug Doole <doole@...> wrote:
> I've been playing with this a bit more, but I'm still stumped.

It is, I'm afraid, a bug.

There are several known problems with string search in this general
area.  See tickets Trac tickets at http://bugs.icu-project.org/trac
#5420,  #5382, #5024, #4279, #4038, #3536,

The problems, unfortunately, do not have ready fixes.  You will need
to evaluate whether ICU string search can  meet your needs in its
current form.

  -- Andy

> A couple
> things I checked:
> - The character break iterator is properly determining the character
> boundaries
> - The language for the collator doesn't seem to make any difference
>
> So, if anyone has a suggestion on how I can get this working properly, I
> would really appreciate it.
>
> > I'm trying to set up a string search where a collation, text and
> > search string are passed in and the position of the match is passed out.
> >
> > Simplifying the code, I have:
> >
> > search(collName, text, search)
> >    ucol = ucol_openFromShortString(collName)
> >    locale = ucol_getLocaleByType(ucol)
> >    ubrk = ubrk_openFromCollator(UBRK_CHARACTER, locale, text)
> >    usearch = usearch_openFromCollator(search, text, ucol, ubrk)
> >    return(usearch_first(usearch)
> >
> > When I first tried this, I didn't have the character break iterator
> > and I was getting incorrect results for accented characters and
> > strength 3 collations. I saw the comments in the ICU documentation
> > and added in the character break iterator.
> >
> > Now, however, I'm having trouble with combining characters and
> > strength 1 collations.
> >
> > If I pass in:
> >    collName = LEN_S1
> >    text = 0043 00D4 0054 00C9   (CÔTÉ)
> >    search = 004F   (O)
> > the function returns offset 1 as expected.
> >
> > However if I pass in:
> >    collName = LEN_S1
> >    text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> >    search = 004F   (O)
> > the search fails and returns -1.
> >
> > So, can anyone tell me what I'm missing here?
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Andy Heninger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 4/25/07, Andy Heninger <andy.heninger@...> wrote:
> On 4/24/07, Doug Doole <doole@...> wrote:
> > I've been playing with this a bit more, but I'm still stumped.
>
> It is, I'm afraid, a bug.
>

Doug wrote
> However if I pass in:
>  collName = LEN_S1
>  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
>  search = 004F   (O)
> the search fails and returns -1.

A question: does it still fail to find the match if you don't use the
character break iterator with stength 1  matching.  Which would be the
case if the underlying match failed to include the combining mark.302
character.

If that's the case, a workaround might be to make use of the break
iterator dependent on strength > 1.

  -- Andy

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Andy wrote:

> Doug wrote
> > However if I pass in:
> >  collName = LEN_S1
> >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> >  search = 004F   (O)
> > the search fails and returns -1.
>
> A question: does it still fail to find the match if you don't use the
> character break iterator with stength 1  matching.  Which would be the
> case if the underlying match failed to include the combining mark.302
> character.
>
> If that's the case, a workaround might be to make use of the break
> iterator dependent on strength > 1.

If I drop the character break iterator, then it does find the matching
character. However it reports the match length as 1 (the combining accent
is not considered part of the match) which causes problems for subsequent
processing that my code needs to do.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Mark Davis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I think a workaround is to
  • use the search without the character break iterator.
  • then use the character break iterator to extend the start and end boundaries to include whole grapheme clusters,
  • then retest that result with the collator set to whatever your strength is,
  • then skip to the next if it doesn't match.
Let me know if that works.

Mark

On 4/27/07, Doug Doole <doole@...> wrote:
Andy wrote:

> Doug wrote
> > However if I pass in:
> >  collName = LEN_S1
> >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> >  search = 004F   (O)
> > the search fails and returns -1.
>
> A question: does it still fail to find the match if you don't use the
> character break iterator with stength 1  matching.  Which would be the
> case if the underlying match failed to include the combining mark.302
> character.
>
> If that's the case, a workaround might be to make use of the break
> iterator dependent on strength > 1.

If I drop the character break iterator, then it does find the matching
character. However it reports the match length as 1 (the combining accent
is not considered part of the match) which causes problems for subsequent
processing that my code needs to do.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support



--
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

UnicodeString comparison to char*

by Glenn-22 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've got the following code fragment (cut down here), and it works,
but I don't understand why.

    const std::string &DesiredTimezone;
    UnicodeString TimeZoneID;
    TimeZone *timezone = TimeZone::createTimeZone (DesiredTimezone.c_str());
    if (timezone->getID (TimeZoneID) != DesiredTimezone.c_str())
        {
        printf ("not the selected timezone\n");
        }

What I don't understand is how the returned value from TimeZone::getID(),
which is a reference to a UnicodeString, is correctly being compared to
the "char*" string (using the obvious string-comparison semantics).
Looking in the UnicodeString Class Reference page, I don't see operators
that would perform the string comparison I want and which is, in fact,
actually happening.  I could just accept this, but I want to know how
it works so I know I'm not making some serious mistake.  What am I missing?


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: UnicodeString comparison to char*

by david_n_bertoni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

icu-support-bounces@... wrote on 04/27/2007 10:39:24 AM:

> I've got the following code fragment (cut down here), and it works,
> but I don't understand why.
>
>     const std::string &DesiredTimezone;
>     UnicodeString TimeZoneID;
>     TimeZone *timezone = TimeZone::createTimeZone
(DesiredTimezone.c_str());
>     if (timezone->getID (TimeZoneID) != DesiredTimezone.c_str())
>         {
>         printf ("not the selected timezone\n");
>         }
>
> What I don't understand is how the returned value from
TimeZone::getID(),
> which is a reference to a UnicodeString, is correctly being compared to
> the "char*" string (using the obvious string-comparison semantics).
> Looking in the UnicodeString Class Reference page, I don't see operators
> that would perform the string comparison I want and which is, in fact,
> actually happening.  I could just accept this, but I want to know how
> it works so I know I'm not making some serious mistake.  What am I
missing?

UnicodeString has a constructor that takes a const char*, and there's an
operator== for two UnicodeString instances, so the compiler constructs a
temporary to do the comparison.

Dave

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: UnicodeString comparison to char*

by Bob Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

From: <david_n_bertoni@...>
[...]

>> What I don't understand is how the returned value from
> TimeZone::getID(),
>> which is a reference to a UnicodeString, is correctly being compared to
>> the "char*" string (using the obvious string-comparison semantics).
>> Looking in the UnicodeString Class Reference page, I don't see operators
>> that would perform the string comparison I want and which is, in fact,
>> actually happening.  I could just accept this, but I want to know how
>> it works so I know I'm not making some serious mistake.  What am I
> missing?
>
> UnicodeString has a constructor that takes a const char*, and there's an
> operator== for two UnicodeString instances, so the compiler constructs a
> temporary to do the comparison.

But what does it do to the narrow const char* to make it 'wide'? Use the
default system code page? Sounds like a constructor that shouldn't exist to
avoid unintended side effects...

Bob


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: UnicodeString comparison to char*

by George Rhoten :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> But what does it do to the narrow const char* to make it 'wide'? Use the

> default system code page?

Yes, that is how char * strings are converted to UChar * strings in the
code snippit. The string is also converted to UTF-16 on the createTimeZone
line.

If you use "#define UCONFIG_NO_CONVERSION 1" in your application, you can
ensure that you're not using codepage conversion. This will also make it
more difficult to use ICU with the non-Unicode OS functions, but it will
help you find the lines that are using codepage conversion. So you can
reduce the number of times you reconvert between char * and UChar * or use
alternate faster ICU functions to do the conversion.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: UnicodeString comparison to char*

by david_n_bertoni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

icu-support-bounces@... wrote on 04/27/2007 08:35:28 PM:

> From: <david_n_bertoni@...>
> [...]
> >> What I don't understand is how the returned value from
> > TimeZone::getID(),
> >> which is a reference to a UnicodeString, is correctly being compared
to
> >> the "char*" string (using the obvious string-comparison semantics).
> >> Looking in the UnicodeString Class Reference page, I don't see
operators
> >> that would perform the string comparison I want and which is, in
fact,
> >> actually happening.  I could just accept this, but I want to know how
> >> it works so I know I'm not making some serious mistake.  What am I
> > missing?
> >
> > UnicodeString has a constructor that takes a const char*, and there's
an
> > operator== for two UnicodeString instances, so the compiler constructs
a
> > temporary to do the comparison.
>
> But what does it do to the narrow const char* to make it 'wide'? Use the

> default system code page? Sounds like a constructor that shouldn't exist
to
> avoid unintended side effects...

Well, the documentation makes this pretty clear:

  /**
   * char* constructor.
   * @param codepageData an array of bytes, null-terminated
   * @param codepage the encoding of <TT>codepageData</TT>.  The special
   * value 0 for <TT>codepage</TT> indicates that the text is in the
   * platform's default codepage.
   *
   * If <code>codepage</code> is an empty string (<code>""</code>),
   * then a simple conversion is performed on the codepage-invariant
   * subset ("invariant characters") of the platform encoding. See
utypes.h.
   * Recommendation: For invariant-character strings use the constructor
   * UnicodeString(const char *src, int32_t length, enum EInvariant inv)
   * because it avoids object code dependencies of UnicodeString on
   * the conversion code.
   *
   * @stable ICU 2.0
   */

And, as George said, you can avoid implicit conversions if you want you.

Dave

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Andy Heninger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 4/27/07, Mark Davis <mark.davis@...> wrote:
> I think a workaround is to
>
> use the search without the character break iterator.
> then use the character break iterator to extend the start and end boundaries
> to include whole grapheme clusters,
>  then retest that result with the collator set to whatever your strength is,
> then skip to the next if it doesn't match.Let me know if that works.
>

Mark, a question about what would what StringSearch should really be
doing (as opposed to what it does now):

In what cases should a match not extend to a grapheme cluster boundary
anyhow, without the extra complication of a user supplied break
iterator?  Something along the lines of, after finding a match,
continue to iterate collation elements in the string until a grapheme
cluster boundary is reached, and any CEs encountered along the way
better be ignorable.

  -- Andy


>
>  On 4/27/07, Doug Doole <doole@...> wrote:
> > Andy wrote:
> > > Doug wrote
> > > > However if I pass in:
> > > >  collName = LEN_S1
> > > >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> > > >  search = 004F   (O)
> > > > the search fails and returns -1.
> > >
> > > A question: does it still fail to find the match if you don't use the
> > > character break iterator with stength 1  matching.  Which would be the
> > > case if the underlying match failed to include the combining mark.302
> > > character.
> > >
> > > If that's the case, a workaround might be to make use of the break
> > > iterator dependent on strength > 1.
> >
> > If I drop the character break iterator, then it does find the matching
> > character. However it reports the match length as 1 (the combining accent
> > is not considered part of the match) which causes problems for subsequent
> > processing that my code needs to do.
> >
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Mark Davis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
One may or may not want to test for that boundary.

Vladimir and I have been discussing some different approaches, and he'll see if we can mock something up to see if it is worth pursuing.

Mark

On 4/29/07, Andy Heninger <andy.heninger@...> wrote:
On 4/27/07, Mark Davis <mark.davis@...> wrote:
> I think a workaround is to
>
> use the search without the character break iterator.
> then use the character break iterator to extend the start and end boundaries
> to include whole grapheme clusters,
>  then retest that result with the collator set to whatever your strength is,
> then skip to the next if it doesn't match.Let me know if that works.
>

Mark, a question about what would what StringSearch should really be
doing (as opposed to what it does now):

In what cases should a match not extend to a grapheme cluster boundary
anyhow, without the extra complication of a user supplied break
iterator?  Something along the lines of, after finding a match,
continue to iterate collation elements in the string until a grapheme
cluster boundary is reached, and any CEs encountered along the way
better be ignorable.

  -- Andy


>
>  On 4/27/07, Doug Doole <doole@...> wrote:
> > Andy wrote:
> > > Doug wrote
> > > > However if I pass in:
> > > >  collName = LEN_S1
> > > >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> > > >  search = 004F   (O)
> > > > the search fails and returns -1.
> > >
> > > A question: does it still fail to find the match if you don't use the
> > > character break iterator with stength 1  matching.  Which would be the
> > > case if the underlying match failed to include the combining mark.302
> > > character.
> > >
> > > If that's the case, a workaround might be to make use of the break
> > > iterator dependent on strength > 1.
> >
> > If I drop the character break iterator, then it does find the matching
> > character. However it reports the match length as 1 (the combining accent
> > is not considered part of the match) which causes problems for subsequent
> > processing that my code needs to do.
> >
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support



--
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Andy Heninger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 5/1/07, Mark Davis <mark.davis@...> wrote:
> One may or may not want to test for that boundary.
>
> Vladimir and I have been discussing some different approaches, and he'll see
> if we can mock something up to see if it is worth pursuing.
>

I've been poking around more in String Search - the  bugs are becoming
more of an issue here.

But, separate from the bugs, the choice of match options provided now
seems way too confusing.  I can't imagine ordinary developers sorting
them out without an excess of pain and annoyance.

Maybe something along these lines would produce usable, non-surprising
results without needing a bunch of options:

Strength 1 matches always absorb all combining stuff following the
match, up to the next base character.

Strength > 1 matches require a perfect collation element match through
all combining stuff at the end of the match in the text being
searched.  The final grapheme cluster of the match is considered as a
whole - it either matches the pattern entirely, or the match fails.
No more tearing it apart and reordering it looking for a match.

Get rid of the canonical vs exact match distinction that exists in the
current API.

Get rid of the break iterator option.

Make a pattern that starts with unattached combining marks either an
error or a warning.  It's probably not going to do what the user
expected - it won't match normal (attached) combining marks from the
text being searched.

If we could redo the API, I'd be inclined towards something much much
simpler - forward searching only, no retained index position, results
returned only as start and limit indexes (compatible with UText, could
be non-UTF-16), nothing explicit for iteration, overlapping vs.
non-overlapping and the like - just a starting index parameter on the
search function that the caller provides & manages.

  -- Andy

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Vladimir Weinstein :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I agree with your analysis Andy. It appears that the current API is  
too complicated and that it should be redone. Another point there is  
that the only reason for backward iteration over CEs in collation is  
string search's need to go back. I'm thinking about this in a  
background thread and I hope to have some ideas figured out soon.

Regards,
v.

On May 2, 2007, at 10:07 AM, Andy Heninger wrote:

> On 5/1/07, Mark Davis <mark.davis@...> wrote:
>> One may or may not want to test for that boundary.
>>
>> Vladimir and I have been discussing some different approaches, and  
>> he'll see
>> if we can mock something up to see if it is worth pursuing.
>>
>
> I've been poking around more in String Search - the  bugs are becoming
> more of an issue here.
>
> But, separate from the bugs, the choice of match options provided now
> seems way too confusing.  I can't imagine ordinary developers sorting
> them out without an excess of pain and annoyance.
>
> Maybe something along these lines would produce usable, non-surprising
> results without needing a bunch of options:
>
> Strength 1 matches always absorb all combining stuff following the
> match, up to the next base character.
>
> Strength > 1 matches require a perfect collation element match through
> all combining stuff at the end of the match in the text being
> searched.  The final grapheme cluster of the match is considered as a
> whole - it either matches the pattern entirely, or the match fails.
> No more tearing it apart and reordering it looking for a match.
>
> Get rid of the canonical vs exact match distinction that exists in the
> current API.
>
> Get rid of the break iterator option.
>
> Make a pattern that starts with unattached combining marks either an
> error or a warning.  It's probably not going to do what the user
> expected - it won't match normal (attached) combining marks from the
> text being searched.
>
> If we could redo the API, I'd be inclined towards something much much
> simpler - forward searching only, no retained index position, results
> returned only as start and limit indexes (compatible with UText, could
> be non-UTF-16), nothing explicit for iteration, overlapping vs.
> non-overlapping and the like - just a starting index parameter on the
> search function that the caller provides & manages.
>
>   -- Andy
>
> ----------------------------------------------------------------------
> ---
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> icu-support mailing list - icu-support@...
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu- 
> support


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mark, I finally got a chance to try this and it seems to work. Thanks!

A bit of follow-up: Why is the last comparison needed? (I'm concerned about
performance.) Can you think of a case where your suggestion would fail
without the extra check?

> I think a workaround is to
> use the search without the character break iterator.
> then use the character break iterator to extend the start and end
> boundaries to include whole grapheme clusters,
> then retest that result with the collator set to whatever your strength
is,
> then skip to the next if it doesn't match.
> Let me know if that works.
>
> Mark

> On 4/27/07, Doug Doole <doole@...> wrote:
> Andy wrote:
> > Doug wrote
> > > However if I pass in:
> > >  collName = LEN_S1
> > >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> > >  search = 004F   (O)
> > > the search fails and returns -1.
> >
> > A question: does it still fail to find the match if you don't use the
> > character break iterator with stength 1  matching.  Which would be the
> > case if the underlying match failed to include the combining mark.302
> > character.
> >
> > If that's the case, a workaround might be to make use of the break
> > iterator dependent on strength > 1.
>
> If I drop the character break iterator, then it does find the matching
> character. However it reports the match length as 1 (the combining accent
> is not considered part of the match) which causes problems for subsequent
> processing that my code needs to do.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Mark Davis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Suppose that you have primary+secondary on, and search for "abc". What you find is "...[abc]^..." (where the ^ is some accent, and [...] means what you found). When you extend the boundary to a grapheme cluster, you get [abc^], but that is not a match with the original. The same thing could happen with Korean. The cost of comparing the substring again should be a small percentage of the total, for this workaround.

Mark

(Also, some of us are now taking a look at this, and we should be able to come up with a more reliable (and higher performance) mechanism.)

On 5/10/07, Doug Doole <doole@...> wrote:
Mark, I finally got a chance to try this and it seems to work. Thanks!

A bit of follow-up: Why is the last comparison needed? (I'm concerned about
performance.) Can you think of a case where your suggestion would fail
without the extra check?

> I think a workaround is to
> use the search without the character break iterator.
> then use the character break iterator to extend the start and end
> boundaries to include whole grapheme clusters,
> then retest that result with the collator set to whatever your strength
is,
> then skip to the next if it doesn't match.
> Let me know if that works.
>
> Mark

> On 4/27/07, Doug Doole <doole@...> wrote:
> Andy wrote:
> > Doug wrote
> > > However if I pass in:
> > >  collName = LEN_S1
> > >  text = 0043 004F 0302 0054 00C9   (CO^TÉ)
> > >  search = 004F   (O)
> > > the search fails and returns -1.
> >
> > A question: does it still fail to find the match if you don't use the
> > character break iterator with stength 1  matching.  Which would be the
> > case if the underlying match failed to include the combining mark.302
> > character.
> >
> > If that's the case, a workaround might be to make use of the break
> > iterator dependent on strength > 1.
>
> If I drop the character break iterator, then it does find the matching
> character. However it reports the match length as 1 (the combining accent
> is not considered part of the match) which causes problems for subsequent
> processing that my code needs to do.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support



--
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Doug Doole :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Suppose that you have primary+secondary on, and search for "abc".
> What you find is "...[abc]^..." (where the ^ is some accent, and
> [...] means what you found). When you extend the boundary to a
> grapheme cluster, you get [abc^], but that is not a match with the
> original. The same thing could happen with Korean. The cost of
> comparing the substring again should be a small percentage of the
> total, for this workaround.

Geez, that's obvious -  apparently my brain is on vacation without me
today.

> (Also, some of us are now taking a look at this, and we should be
> able to come up with a more reliable (and higher performance) mechanism.)

That's good to hear.

However I'm stuck with the broken code for now. You wouldn't happen to know
a good work around for the unable to match leading/trailing characters bugs
(see bugs 5024, 5420)? (It seems that adding a space at the beginning and
ending of the text to be searched solves the problem, but that's a big
pain.)


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Mark Davis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If the only remaining failure cases are at the beginning and end of the searched-in text, I'd suggest -- as a workaround --
  • if the searched-in text is reasonably small, then add a space at the front and end, as you say.
  • if it is large, then just take a chunk at the start, say 2x the size of the searched-for text, add a space to it and search. Do the same at the end.
Mark

On 5/10/07, Doug Doole <doole@...> wrote:
> Suppose that you have primary+secondary on, and search for "abc".
> What you find is "...[abc]^..." (where the ^ is some accent, and
> [...] means what you found). When you extend the boundary to a
> grapheme cluster, you get [abc^], but that is not a match with the
> original. The same thing could happen with Korean. The cost of
> comparing the substring again should be a small percentage of the
> total, for this workaround.

Geez, that's obvious -  apparently my brain is on vacation without me
today.

> (Also, some of us are now taking a look at this, and we should be
> able to come up with a more reliable (and higher performance) mechanism.)

That's good to hear.

However I'm stuck with the broken code for now. You wouldn't happen to know
a good work around for the unable to match leading/trailing characters bugs
(see bugs 5024, 5420)? (It seems that adding a space at the beginning and
ending of the text to be searched solves the problem, but that's a big
pain.)


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support



--
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support

Re: Case and accent insensitive string search

by Andy Heninger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 5/10/07, Doug Doole <doole@...> wrote:
 [snip]
> However I'm stuck with the broken code for now. You wouldn't happen to know
> a good work around for the unable to match leading/trailing characters bugs
> (see bugs 5024, 5420)? (It seems that adding a space at the beginning and
> ending of the text to be searched solves the problem, but that's a big
> pain.)
>

I'm continuing to look into alternate C/C++ search implementations
that should eliminate the problems with the existing one.  But there
are some things going on that I don't understand having to do with
expansions, matching the German ß with ss, for example.   The issues
here do affect the current implementation and are not restricted to
the ends of the string.  I'll let you know what comes of it.

  -- Andy


  -- Andy

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
icu-support mailing list - icu-support@...
To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
< Prev | 1 - 2 | Next >