Re: indexing and searching unicode diacritics

View: New views
6 Messages — Rating Filter:   Alert me  

Re: indexing and searching unicode diacritics

by StMarcus :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi there,
nearly 2 months ago i posted some questions concerning full-text search
and diacritical marks, which are included in the unicode standard, but
are unable to be searched with exist 1.2.4, cause i'm always getting
errors, while those carachters are not "standard"
I tried to search for them with the java-ui and also with xpl....
Will this be possible with the new full-text search feature in 1.4? Or
is there any other posibility to make those carachters searchable?
Thanks for any response.

Regards, Marcus


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: indexing and searching unicode diacritics

by Michael Sokolov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I was able to get this working (partially) by implementing a custom
"analyzer" for the Lucene index which strips diacritics: it's based on the
Lucene ISOLatin1AccentFilter class.  This isn't part of the eXist
distribution, however: you'd have to install the code, build a jar yourself
(or incorporate into a custom eXist build).  

And even with this though, there are some issues: you can no longer search
*with* diacritics (you have to strip them out of search terms), and eXist's
highlighting won't find matches correctly.  Another issue is that the filter
mentioned above only works with Latin1 terms (as the name indicates), and
not diacritics found in other code blocks (eastern european ones,
basically). If, in spite of all the disclaimers, you're still interested, I
can post what I have as a patch so you can grab it.

-Mike

> -----Original Message-----
> From: Marcus [mailto:stmarcus@...]
> Sent: Monday, November 02, 2009 11:13 AM
> To: exist-open@...
> Subject: Re: [Exist-open] indexing and searching unicode diacritics
>
> Hi there,
> nearly 2 months ago i posted some questions concerning
> full-text search and diacritical marks, which are included in
> the unicode standard, but are unable to be searched with
> exist 1.2.4, cause i'm always getting errors, while those
> carachters are not "standard"
> I tried to search for them with the java-ui and also with xpl....
> Will this be possible with the new full-text search feature
> in 1.4? Or is there any other posibility to make those
> carachters searchable?
> Thanks for any response.
>
> Regards, Marcus
>
>
> --------------------------------------------------------------
> ----------------
> Come build with us! The BlackBerry(R) Developer Conference in
> SF, CA is the only developer event you need to attend this
> year. Jumpstart your developing skills, take BlackBerry
> mobile applications to market and stay ahead of the curve.
> Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: indexing and searching unicode diacritics

by Michael Sokolov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Y'know I realized after re-reading your post that I didn't really understand
whether you are trying to search for characters with diacritics,
specifically, or to ignore them (which was my use case).

So which is it?

-Mike

> -----Original Message-----
> From: Michael Sokolov [mailto:sokolov@...]
> Sent: Monday, November 02, 2009 7:40 PM
> To: stmarcus@...; exist-open@...
> Subject: Re: [Exist-open] indexing and searching unicode diacritics
>
> I was able to get this working (partially) by implementing a
> custom "analyzer" for the Lucene index which strips
> diacritics: it's based on the Lucene ISOLatin1AccentFilter
> class.  This isn't part of the eXist distribution, however:
> you'd have to install the code, build a jar yourself (or
> incorporate into a custom eXist build).  
>
> And even with this though, there are some issues: you can no
> longer search
> *with* diacritics (you have to strip them out of search
> terms), and eXist's highlighting won't find matches
> correctly.  Another issue is that the filter mentioned above
> only works with Latin1 terms (as the name indicates), and not
> diacritics found in other code blocks (eastern european ones,
> basically). If, in spite of all the disclaimers, you're still
> interested, I can post what I have as a patch so you can grab it.
>
> -Mike
>
> > -----Original Message-----
> > From: Marcus [mailto:stmarcus@...]
> > Sent: Monday, November 02, 2009 11:13 AM
> > To: exist-open@...
> > Subject: Re: [Exist-open] indexing and searching unicode diacritics
> >
> > Hi there,
> > nearly 2 months ago i posted some questions concerning full-text
> > search and diacritical marks, which are included in the unicode
> > standard, but are unable to be searched with exist 1.2.4, cause i'm
> > always getting errors, while those carachters are not "standard"
> > I tried to search for them with the java-ui and also with xpl....
> > Will this be possible with the new full-text search feature
> in 1.4? Or
> > is there any other posibility to make those carachters searchable?
> > Thanks for any response.
> >
> > Regards, Marcus
> >
> >
> > --------------------------------------------------------------
> > ----------------
> > Come build with us! The BlackBerry(R) Developer Conference
> in SF, CA
> > is the only developer event you need to attend this year. Jumpstart
> > your developing skills, take BlackBerry mobile applications
> to market
> > and stay ahead of the curve.
> > Join us from November 9 - 12, 2009. Register now!
> > http://p.sf.net/sfu/devconference
> > _______________________________________________
> > Exist-open mailing list
> > Exist-open@...
> > https://lists.sourceforge.net/lists/listinfo/exist-open
> >
>
>
> --------------------------------------------------------------
> ----------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year.
> Jumpstart your
> developing skills, take BlackBerry mobile applications to
> market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: indexing and searching unicode diacritics

by StMarcus :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Mike,

thanks for your answer.
Our texts include diacritics and we must be able to search for them or for terms including diacritics!
My problem is, that the old exist-db (1.2.4) seems not to be able to handle those characters in XQuery. When i enter searchterms in utf-8 including diacritics i always get errors, even in the java-client!

So if Lucence will handle those characters - indexing and searching - that would be great! But since two months i'm not getting any answers on that question :(

So thanks for your replay.
Kind Regards, Marcus


-------- Original-Nachricht --------
> Datum: Mon, 2 Nov 2009 19:43:39 -0500
> Von: "Michael Sokolov" <sokolov@...>
> An: "\'Michael Sokolov\'" <sokolov@...>, stmarcus@..., exist-open@...
> Betreff: RE: [Exist-open] indexing and searching unicode diacritics

> Y'know I realized after re-reading your post that I didn't really
> understand
> whether you are trying to search for characters with diacritics,
> specifically, or to ignore them (which was my use case).
>
> So which is it?
>
> -Mike
>
> > -----Original Message-----
> > From: Michael Sokolov [mailto:sokolov@...]
> > Sent: Monday, November 02, 2009 7:40 PM
> > To: stmarcus@...; exist-open@...
> > Subject: Re: [Exist-open] indexing and searching unicode diacritics
> >
> > I was able to get this working (partially) by implementing a
> > custom "analyzer" for the Lucene index which strips
> > diacritics: it's based on the Lucene ISOLatin1AccentFilter
> > class.  This isn't part of the eXist distribution, however:
> > you'd have to install the code, build a jar yourself (or
> > incorporate into a custom eXist build).  
> >
> > And even with this though, there are some issues: you can no
> > longer search
> > *with* diacritics (you have to strip them out of search
> > terms), and eXist's highlighting won't find matches
> > correctly.  Another issue is that the filter mentioned above
> > only works with Latin1 terms (as the name indicates), and not
> > diacritics found in other code blocks (eastern european ones,
> > basically). If, in spite of all the disclaimers, you're still
> > interested, I can post what I have as a patch so you can grab it.
> >
> > -Mike
> >
> > > -----Original Message-----
> > > From: Marcus [mailto:stmarcus@...]
> > > Sent: Monday, November 02, 2009 11:13 AM
> > > To: exist-open@...
> > > Subject: Re: [Exist-open] indexing and searching unicode diacritics
> > >
> > > Hi there,
> > > nearly 2 months ago i posted some questions concerning full-text
> > > search and diacritical marks, which are included in the unicode
> > > standard, but are unable to be searched with exist 1.2.4, cause i'm
> > > always getting errors, while those carachters are not "standard"
> > > I tried to search for them with the java-ui and also with xpl....
> > > Will this be possible with the new full-text search feature
> > in 1.4? Or
> > > is there any other posibility to make those carachters searchable?
> > > Thanks for any response.
> > >
> > > Regards, Marcus
> > >
> > >
> > > --------------------------------------------------------------
> > > ----------------
> > > Come build with us! The BlackBerry(R) Developer Conference
> > in SF, CA
> > > is the only developer event you need to attend this year. Jumpstart
> > > your developing skills, take BlackBerry mobile applications
> > to market
> > > and stay ahead of the curve.
> > > Join us from November 9 - 12, 2009. Register now!
> > > http://p.sf.net/sfu/devconference
> > > _______________________________________________
> > > Exist-open mailing list
> > > Exist-open@...
> > > https://lists.sourceforge.net/lists/listinfo/exist-open
> > >
> >
> >
> > --------------------------------------------------------------
> > ----------------
> > Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> > is the only developer event you need to attend this year.
> > Jumpstart your
> > developing skills, take BlackBerry mobile applications to
> > market and stay
> > ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> > http://p.sf.net/sfu/devconference
> > _______________________________________________
> > Exist-open mailing list
> > Exist-open@...
> > https://lists.sourceforge.net/lists/listinfo/exist-open
> >

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: indexing and searching unicode diacritics

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Marcus,

did you try searching diacritics with Lucene in eXist 1.4? But you
said you are getting XML parsing and/or XQuery errors? So it doesn't
seem to be an index issue alone.

Well, it would be good if you could post an actual example, something
we can try to store and query. Otherwise it will be difficult to give
an answer.

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: indexing and searching unicode diacritics

by StMarcus :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Wolfgang,

i'm still using exist 1.2.4 and haven't tried it under 1.4
I will post an actual example tonight!
I don't know if my mail programm will accept utf-8, so maybe i have to
attach some files.

Till then, thanks for the response,
Regards, Marcus




Wolfgang Meier schrieb:

> Hi Marcus,
>
> did you try searching diacritics with Lucene in eXist 1.4? But you
> said you are getting XML parsing and/or XQuery errors? So it doesn't
> seem to be an index issue alone.
>
> Well, it would be good if you could post an actual example, something
> we can try to store and query. Otherwise it will be difficult to give
> an answer.
>
> Wolfgang
>  


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open