lucene farsi problem

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi,

i am using lucene's "IndexSearcher" to search the given xml by keyword which contains farsi information.
while searching i use ranges like

آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

when i do search for  "د-ژ"  range the results are wrong , they are the results of  " س-ظ "range.

for example when i do search for "د-ژ"  one of the the results is "ساب ووفر" , this result also shown on the " س-ظ " range's result list which is the corret range.

As IndexSearcher use "compareTo" method and this method uses unicodes for comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of "ساب ووفر " is  س = U+633

Do you have any idea how to solve this problem, there are analyzers for different languages ,
will this be usefull if so do you know where to find a farsi analyzer?

I would bu glad if you help.

thanks ,

Esra

Re: lucene farsi problem

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What Analyzer are you using?  You might try looking in Luke to see  
what is in your index, etc.  It also isn't clear to me what your  
documents look like.

As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and  
see if you can find anything.  Otherwise, you will have to write your  
own (and donate it????)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:

>
> hi,
>
> i am using lucene's "IndexSearcher" to search the given xml by  
> keyword which
> contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong , they are  
> the
> results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the results is  
> "ساب ووفر"
> , this result also shown on the " س-ظ " range's result list which  
> is the
> corret range.
>
> As IndexSearcher use "compareTo" method and this method uses  
> unicodes for
> comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633
>
> Do you have any idea how to solve this problem, there are analyzers  
> for
> different languages ,
> will this be usefull if so do you know where to find a farsi analyzer?
>
> I would bu glad if you help.
>
> thanks ,
>
> Esra
>
> --
> View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:

<keyword><![CDATA[ساب ووفر]]></keyword>
      <description><![CDATA[یک ووفر که در محفظه ای جدا از سایر درایور ها قرار دارد تا صدایی با باس فوق العاده پایین تولید کند. ]]></description>
   
i googled for farsi analyzer and found nothing also i am not sure it if would solve my problem or not.

Thanks,

Esra

Grant Ingersoll-6 wrote:
What Analyzer are you using?  You might try looking in Luke to see  
what is in your index, etc.  It also isn't clear to me what your  
documents look like.

As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and  
see if you can find anything.  Otherwise, you will have to write your  
own (and donate it????)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:

>
> hi,
>
> i am using lucene's "IndexSearcher" to search the given xml by  
> keyword which
> contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong , they are  
> the
> results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the results is  
> "ساب ووفر"
> , this result also shown on the " س-ظ " range's result list which  
> is the
> corret range.
>
> As IndexSearcher use "compareTo" method and this method uses  
> unicodes for
> comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633
>
> Do you have any idea how to solve this problem, there are analyzers  
> for
> different languages ,
> will this be usefull if so do you know where to find a farsi analyzer?
>
> I would bu glad if you help.
>
> thanks ,
>
> Esra
>
> --
> View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene farsi problem

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am not sure how Standard Analyzer will perform on Farsi.  The thing  
to do now would be to get Luke and have a look at the actual document  
that matches and see what it's tokens look like.  You might also try  
using the explain() method to see why that document matches.

Also, are you sure you are loading the file w/ the proper encodings,  
etc?

-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:

>
> Hi,
> thanks for your reply.
> I am using StandartAnalyzer now and my xml document is like below:
>
> <keyword><![CDATA[ساب ووفر]]></keyword>
>      <description><![CDATA[یک ووفر که در محفظه ای  
> جدا از سایر درایور ها
> قرار دارد تا صدایی با باس فوق العاده  
> پایین تولید کند. ]]></description>
>
> i googled for farsi analyzer and found nothing also i am not sure it  
> if
> would solve my problem or not.
>
> Thanks,
>
> Esra
>
>
> Grant Ingersoll-6 wrote:
>>
>> What Analyzer are you using?  You might try looking in Luke to see
>> what is in your index, etc.  It also isn't clear to me what your
>> documents look like.
>>
>> As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
>> see if you can find anything.  Otherwise, you will have to write your
>> own (and donate it????)
>>
>> -Grant
>>
>> On Apr 30, 2008, at 3:21 AM, esra wrote:
>>
>>>
>>> hi,
>>>
>>> i am using lucene's "IndexSearcher" to search the given xml by
>>> keyword which
>>> contains farsi information.
>>> while searching i use ranges like
>>>
>>> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>>>
>>> when i do search for  "د-ژ"  range the results are wrong , they  
>>> are
>>> the
>>> results of  " س-ظ "range.
>>>
>>> for example when i do search for "د-ژ"  one of the the results is
>>> "ساب ووفر"
>>> , this result also shown on the " س-ظ " range's result list which
>>> is the
>>> corret range.
>>>
>>> As IndexSearcher use "compareTo" method and this method uses
>>> unicodes for
>>> comparing, i found the unicodes of the characters.
>>>
>>> د=U+62F
>>> ژ = U+698
>>> and the first letter of "ساب ووفر " is  س = U+633
>>>
>>> Do you have any idea how to solve this problem, there are analyzers
>>> for
>>> different languages ,
>>> will this be usefull if so do you know where to find a farsi  
>>> analyzer?
>>>
>>> I would bu glad if you help.
>>>
>>> thanks ,
>>>
>>> Esra
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

Caveat: I don't speak, read, write, or dream in Farsi - I just know that it mostly shares its orthography with Arabic, and that they are both written and read right-to-left.

How are you constructing the queries?  Using QueryParser?  If so, then I suspect the problem is that you intend the range you supply to be read entirely right-to-left, but Lucene instead reads it left-to-right.  Have you tried using e.g. "د-ژ" instead of "د-ژ"?  (That is, placing the lower valued term on the left instead of the right.)

AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called from QueryParser) does not test whether lowerTerm is in fact lower than upperTerm.  If it turns out that the problem is simply one of order, it might make sense to modify RangeFilter so that it flip them when lowerTerm > upperTerm.

Steve

On 04/30/2008 at 3:21 AM, esra wrote:

>
> hi,
>
> i am using lucene's "IndexSearcher" to search the given xml
> by keyword which
> contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong ,
> they are the
> results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the
> results is "ساب ووفر"
> , this result also shown on the " س-ظ " range's result list
> which is the
> corret range.
>
> As IndexSearcher use "compareTo" method and this method uses
> unicodes for
> comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633
>
> Do you have any idea how to solve this problem, there are
> analyzers for
> different languages ,
> will this be usefull if so do you know where to find a farsi analyzer?
>
> I would bu glad if you help.
>
> thanks ,
>
> Esra
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@... For
> additional commands, e-mail: java-user-help@...
>
>

 


RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 04/30/2008 at 12:50 PM, Steven A Rowe wrote:

> Caveat: I don't speak, read, write, or dream in Farsi - I
> just know that it mostly shares its orthography with Arabic,
> and that they are both written and read right-to-left.
>
> How are you constructing the queries?  Using QueryParser?  If
> so, then I suspect the problem is that you intend the range
> you supply to be read entirely right-to-left, but Lucene
> instead reads it left-to-right.  Have you tried using e.g.
> "د-ژ" instead of "د-ژ"?  (That is, placing the lower valued
> term on the left instead of the right.)

Sigh - can't edit RTL text - the example should be (hoping it doesn't get reversed again):

"ژ-د" instead of "د-ژ" (reversing the order of the lower and upper terms)

> AFAICT, RangeFilter (called from ConstantScoreRangeQuery,
> which is called from QueryParser) does not test whether
> lowerTerm is in fact lower than upperTerm.  If it turns out
> that the problem is simply one of order, it might make sense
> to modify RangeFilter so that it flip them when lowerTerm > upperTerm.
>
> Steve
>
> On 04/30/2008 at 3:21 AM, esra wrote:
> >
> > hi,
> >
> > i am using lucene's "IndexSearcher" to search the given xml
> > by keyword which
> > contains farsi information.
> > while searching i use ranges like
> >
> > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> >
> > when i do search for  "د-ژ"  range the results are wrong ,
> > they are the
> > results of  " س-ظ "range.
> >
> > for example when i do search for "د-ژ"  one of the the
> > results is "ساب ووفر"
> > , this result also shown on the " س-ظ " range's result list
> > which is the
> > corret range.
> >
> > As IndexSearcher use "compareTo" method and this method uses
> > unicodes for
> > comparing, i found the unicodes of the characters.
> >
> > د=U+62F
> > ژ = U+698
> > and the first letter of "ساب ووفر " is  س = U+633
> >
> > Do you have any idea how to solve this problem, there are analyzers for
> > different languages , will this be usefull if so do you know where to
> > find a farsi analyzer?
> >
> > I would bu glad if you help.
> >
> > thanks ,
> >
> > Esra



RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steve,

thanks for your reply , i know farsi is written and read right-to-left.
i am using RangeOuery class and it's rewrite(IndexReader reader) method decides if the word is in range or not by compareTo method and this decision is made by using unicodes.

while searching for "د-ژ" range the lowerTerm is "د" and  the upperTerm is "ژ".
And while comparing for the result "ساب ووفر" also takes the first letter as س and does the comparison for this letter.

 د=U+62F
 ژ = U+698
 and the first letter of "ساب ووفر " is  س = U+633

Esra,

Steven A Rowe wrote:
Hi Esra,

Caveat: I don't speak, read, write, or dream in Farsi - I just know that it mostly shares its orthography with Arabic, and that they are both written and read right-to-left.

How are you constructing the queries?  Using QueryParser?  If so, then I suspect the problem is that you intend the range you supply to be read entirely right-to-left, but Lucene instead reads it left-to-right.  Have you tried using e.g. "د-ژ" instead of "د-ژ"?  (That is, placing the lower valued term on the left instead of the right.)

AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called from QueryParser) does not test whether lowerTerm is in fact lower than upperTerm.  If it turns out that the problem is simply one of order, it might make sense to modify RangeFilter so that it flip them when lowerTerm > upperTerm.

Steve

On 04/30/2008 at 3:21 AM, esra wrote:
>
> hi,
>
> i am using lucene's "IndexSearcher" to search the given xml
> by keyword which
> contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong ,
> they are the
> results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the
> results is "ساب ووفر"
> , this result also shown on the " س-ظ " range's result list
> which is the
> corret range.
>
> As IndexSearcher use "compareTo" method and this method uses
> unicodes for
> comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633
>
> Do you have any idea how to solve this problem, there are
> analyzers for
> different languages ,
> will this be usefull if so do you know where to find a farsi analyzer?
>
> I would bu glad if you help.
>
> thanks ,
>
> Esra
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-user-help@lucene.apache.org
>
>

 

Re: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

document's encoding is "UTF-8".

i tried the  explain() method and the result for "د-ژ"  range searching is:

  fieldWeight(keywordIndex:ساب ووÙ�ر in 0), product of:
  1.0 = tf(termFreq(keywordIndex:ساب ووÙ�ر)=1)
  0.30685282 = idf(docFreq=1)
  1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is "ساب ووفر".

 i also  installed the "luke.jnlp"  but i don't know what to check by Luke.

Thanks,

Esra


Grant Ingersoll-6 wrote:
I am not sure how Standard Analyzer will perform on Farsi.  The thing  
to do now would be to get Luke and have a look at the actual document  
that matches and see what it's tokens look like.  You might also try  
using the explain() method to see why that document matches.

Also, are you sure you are loading the file w/ the proper encodings,  
etc?

-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:

>
> Hi,
> thanks for your reply.
> I am using StandartAnalyzer now and my xml document is like below:
>
> <keyword><![CDATA[ساب ووفر]]></keyword>
>      <description><![CDATA[یک ووفر که در محفظه ای  
> جدا از سایر درایور ها
> قرار دارد تا صدایی با باس فوق العاده  
> پایین تولید کند. ]]></description>
>
> i googled for farsi analyzer and found nothing also i am not sure it  
> if
> would solve my problem or not.
>
> Thanks,
>
> Esra
>
>
> Grant Ingersoll-6 wrote:
>>
>> What Analyzer are you using?  You might try looking in Luke to see
>> what is in your index, etc.  It also isn't clear to me what your
>> documents look like.
>>
>> As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
>> see if you can find anything.  Otherwise, you will have to write your
>> own (and donate it????)
>>
>> -Grant
>>
>> On Apr 30, 2008, at 3:21 AM, esra wrote:
>>
>>>
>>> hi,
>>>
>>> i am using lucene's "IndexSearcher" to search the given xml by
>>> keyword which
>>> contains farsi information.
>>> while searching i use ranges like
>>>
>>> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>>>
>>> when i do search for  "د-ژ"  range the results are wrong , they  
>>> are
>>> the
>>> results of  " س-ظ "range.
>>>
>>> for example when i do search for "د-ژ"  one of the the results is
>>> "ساب ووفر"
>>> , this result also shown on the " س-ظ " range's result list which
>>> is the
>>> corret range.
>>>
>>> As IndexSearcher use "compareTo" method and this method uses
>>> unicodes for
>>> comparing, i found the unicodes of the characters.
>>>
>>> د=U+62F
>>> ژ = U+698
>>> and the first letter of "ساب ووفر " is  س = U+633
>>>
>>> Do you have any idea how to solve this problem, there are analyzers
>>> for
>>> different languages ,
>>> will this be usefull if so do you know where to find a farsi  
>>> analyzer?
>>>
>>> I would bu glad if you help.
>>>
>>> thanks ,
>>>
>>> Esra
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

Going back to the original problem statement, I see something that looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:

> i am using lucene's "IndexSearcher" to search the given xml by
> keyword which contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong , they
> are the results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the results is
> "ساب ووفر", this result also shown on the " س-ظ " range's result
> list which is the corret range.
>
> As IndexSearcher use "compareTo" method and this method uses
> unicodes for comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633

It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب ووفر", which is "س" = U+0633.  

You stated that U+0633 should be contained in the [ U+0633 - U+0638 ] range - I agree - but why do you think U+0633 should not be contained in the [ U+062F - U+0698 ] range?

In other words, it looks to me like your problem is not a problem at all.

Steve

Re: lucene farsi problem

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On May 1, 2008, at 4:36 AM, esra wrote:

>
> Hi,
>
> document's encoding is "UTF-8".
>
> i tried the  explain() method and the result for "د-ژ"  range  
> searching is:
>
>  fieldWeight(keywordIndex:ساب ووÙ�ر in 0),  
> product of:
>  1.0 = tf(termFreq(keywordIndex:ساب ووÙ�ر)=1)
>  0.30685282 = idf(docFreq=1)
>  1.0 = fieldNorm(field=keywordIndex, doc=0)
>
> here keywordIndex is "ساب ووفر".
>
> i also  installed the "luke.jnlp"  but i don't know what to check by  
> Luke.
>


http://wiki.apache.org/lucene-java/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

Luke can be used to view your index.  Not saying it is your problem  
here, but often times when I get back results that "seem" incorrect,  
the first thing I do is go look at my index using Luke, and compare  
the "incorrect" document with what is in the query to see where the  
(mis)match is occurring.   Usually, this analysis shows that my  
document/query is not what I thought it was.

Luke can browse documents and parse queries, amongst other useful  
things.




> Thanks,
>
> Esra
>
>
>
> Grant Ingersoll-6 wrote:
>>
>> I am not sure how Standard Analyzer will perform on Farsi.  The thing
>> to do now would be to get Luke and have a look at the actual document
>> that matches and see what it's tokens look like.  You might also try
>> using the explain() method to see why that document matches.
>>
>> Also, are you sure you are loading the file w/ the proper encodings,
>> etc?
>>
>> -Grant
>>
>> On Apr 30, 2008, at 8:06 AM, esra wrote:
>>
>>>
>>> Hi,
>>> thanks for your reply.
>>> I am using StandartAnalyzer now and my xml document is like below:
>>>
>>> <keyword><![CDATA[ساب ووفر]]></keyword>
>>>     <description><![CDATA[یک ووفر که در محفظه ای
>>> جدا از سایر درایور ها
>>> قرار دارد تا صدایی با باس فوق العاده
>>> پایین تولید کند. ]]></description>
>>>
>>> i googled for farsi analyzer and found nothing also i am not sure it
>>> if
>>> would solve my problem or not.
>>>
>>> Thanks,
>>>
>>> Esra
>>>
>>>
>>> Grant Ingersoll-6 wrote:
>>>>
>>>> What Analyzer are you using?  You might try looking in Luke to see
>>>> what is in your index, etc.  It also isn't clear to me what your
>>>> documents look like.
>>>>
>>>> As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
>>>> see if you can find anything.  Otherwise, you will have to write  
>>>> your
>>>> own (and donate it????)
>>>>
>>>> -Grant
>>>>
>>>> On Apr 30, 2008, at 3:21 AM, esra wrote:
>>>>
>>>>>
>>>>> hi,
>>>>>
>>>>> i am using lucene's "IndexSearcher" to search the given xml by
>>>>> keyword which
>>>>> contains farsi information.
>>>>> while searching i use ranges like
>>>>>
>>>>> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>>>>>
>>>>> when i do search for  "د-ژ"  range the results are wrong , they
>>>>> are
>>>>> the
>>>>> results of  " س-ظ "range.
>>>>>
>>>>> for example when i do search for "د-ژ"  one of the the results  
>>>>> is
>>>>> "ساب ووفر"
>>>>> , this result also shown on the " س-ظ " range's result list  
>>>>> which
>>>>> is the
>>>>> corret range.
>>>>>
>>>>> As IndexSearcher use "compareTo" method and this method uses
>>>>> unicodes for
>>>>> comparing, i found the unicodes of the characters.
>>>>>
>>>>> د=U+62F
>>>>> ژ = U+698
>>>>> and the first letter of "ساب ووفر " is  س = U+633
>>>>>
>>>>> Do you have any idea how to solve this problem, there are  
>>>>> analyzers
>>>>> for
>>>>> different languages ,
>>>>> will this be usefull if so do you know where to find a farsi
>>>>> analyzer?
>>>>>
>>>>> I would bu glad if you help.
>>>>>
>>>>> thanks ,
>>>>>
>>>>> Esra
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/lucene-farsi-problem- 
>>>>> tp16977096p16977096.html
>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>>> For additional commands, e-mail: java-user-help@...
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>>> For additional commands, e-mail: java-user-help@...
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@...
>>> For additional commands, e-mail: java-user-help@...
>>>
>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@...
>> For additional commands, e-mail: java-user-help@...
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p16993174.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steven,

sorry i made a mistake. unicodes are like this:

> د=U+62F
> ژ = U+632
> and the first letter of "ساب ووفر " is  س = U+633

you can also check them here :http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html

Esra

Steven A Rowe wrote:
Hi Esra,

Going back to the original problem statement, I see something that looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
> i am using lucene's "IndexSearcher" to search the given xml by
> keyword which contains farsi information.
> while searching i use ranges like
>
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>
> when i do search for  "د-ژ"  range the results are wrong , they
> are the results of  " س-ظ "range.
>
> for example when i do search for "د-ژ"  one of the the results is
> "ساب ووفر", this result also shown on the " س-ظ " range's result
> list which is the corret range.
>
> As IndexSearcher use "compareTo" method and this method uses
> unicodes for comparing, i found the unicodes of the characters.
>
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633

It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب ووفر", which is "س" = U+0633.  

You stated that U+0633 should be contained in the [ U+0633 - U+0638 ] range - I agree - but why do you think U+0633 should not be contained in the [ U+062F - U+0698 ] range?

In other words, it looks to me like your problem is not a problem at all.

Steve

RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

I still think you're wrong :).

On 05/02/2008 at 9:31 AM, esra wrote:
> > ژ = U+632

According to the website you linked to, the above character, which has three dots over it, is named "zhe", and its Unicode code point is U+698.  (I had to increase the font size to see the three dots.)

I think you are confusing "ژ"/"zhe"/U+698 with the letter "ز"/"ze"/U+632, which has just one dot over it.

Unless you were mistaken in all of your emails when you included the character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my previous email still stands: there is no problem here.

Steve

On 05/02/2008 at 9:31 AM, esra wrote:

>
> Hi Steven,
>
> sorry i made a mistake. unicodes are like this:
>
> > د=U+62F
> > ژ = U+632
> > and the first letter of "ساب ووفر " is  س = U+633
>
> you can also check them here
> > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > Going back to the original problem statement, I see something that
> > looks illogical to me - please correct me if I'm wrong:
> >
> > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > i am using lucene's "IndexSearcher" to search the given xml by
> > > keyword which contains farsi information.
> > > while searching i use ranges like
> > >
> > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > >
> > > when i do search for  "د-ژ"  range the results are wrong , they
> > > are the results of  " س-ظ "range.
> > >
> > > for example when i do search for "د-ژ"  one of the the results is
> > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > list which is the corret range.
> > >
> > > As IndexSearcher use "compareTo" method and this method uses
> > > unicodes for comparing, i found the unicodes of the characters.
> > >
> > > د=U+62F
> > > ژ = U+698
> > > and the first letter of "ساب ووفر " is  س = U+633
> >
> > It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and
> > the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب
> > ووفر", which is "س" = U+0633.
> >
> > You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
> > range - I agree - but why do you think U+0633 should not be contained
> > in the [ U+062F - U+0698 ] range?
> >
> > In other words, it looks to me like your problem is not a problem at
> > all.
> >
> > Steve
> >
> >
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@... For
> additional commands, e-mail: java-user-help@...
>
>

 


RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steven ,

yes the correct one is "ژ "/"ze"/U+632.

my problem is when i do search for  "  د-ژ" range. The result is  ""ساب ووفر  " and this word's first letter is "س " and it's unicode is "U+633"  and  it is not in the in the [ U+062F - U+0632 ] range.

am i wrong?

Esra

Steven A Rowe wrote:
Hi Esra,

I still think you're wrong :).

On 05/02/2008 at 9:31 AM, esra wrote:
> > ژ = U+632

According to the website you linked to, the above character, which has three dots over it, is named "zhe", and its Unicode code point is U+698.  (I had to increase the font size to see the three dots.)

I think you are confusing "ژ"/"zhe"/U+698 with the letter "ز"/"ze"/U+632, which has just one dot over it.

Unless you were mistaken in all of your emails when you included the character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my previous email still stands: there is no problem here.

Steve

On 05/02/2008 at 9:31 AM, esra wrote:
>
> Hi Steven,
>
> sorry i made a mistake. unicodes are like this:
>
> > د=U+62F
> > ژ = U+632
> > and the first letter of "ساب ووفر " is  س = U+633
>
> you can also check them here
> > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > Going back to the original problem statement, I see something that
> > looks illogical to me - please correct me if I'm wrong:
> >
> > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > i am using lucene's "IndexSearcher" to search the given xml by
> > > keyword which contains farsi information.
> > > while searching i use ranges like
> > >
> > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > >
> > > when i do search for  "د-ژ"  range the results are wrong , they
> > > are the results of  " س-ظ "range.
> > >
> > > for example when i do search for "د-ژ"  one of the the results is
> > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > list which is the corret range.
> > >
> > > As IndexSearcher use "compareTo" method and this method uses
> > > unicodes for comparing, i found the unicodes of the characters.
> > >
> > > د=U+62F
> > > ژ = U+698
> > > and the first letter of "ساب ووفر " is  س = U+633
> >
> > It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and
> > the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب
> > ووفر", which is "س" = U+0633.
> >
> > You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
> > range - I agree - but why do you think U+0633 should not be contained
> > in the [ U+062F - U+0698 ] range?
> >
> > In other words, it looks to me like your problem is not a problem at
> > all.
> >
> > Steve
> >
> >
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-user-help@lucene.apache.org
>
>

 

RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

You are *still* incorrectly referring to the glyph with three dots over it:

On 05/02/2008 at 12:18 PM, esra wrote:
> yes the correct one is "ژ "/"ze"/U+632.

"ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.

Have you increased the font size?  Can you see the difference between these two?:

"ژ"/"zhe"/U+698
"ز"/"ze"/U+632

> my problem is when i do search for  "د-ژ" range. The result
> is  "ساب ووفر" and this word's first letter is "س" and it's unicode is
> "U+633"  and it is not in the in the [ U+062F - U+0632 ] range.

Like I keep saying, in the above description, you're using the glyph "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring to it as "ze"/U+632.

I don't mean to continually bang on about this - if you're *sure* that when you search, you're using the character represented by the glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the problem lies elsewhere.

Steve

On 05/02/2008 at 12:18 PM, esra wrote:

> yes the correct one is "ژ "/"ze"/U+632.
>
> my problem is when i do search for  "  د-ژ" range. The result
> is  ""ساب ووفر
> " and this word's first letter is "س " and it's unicode is
> "U+633"  and  it
> is not in the in the [ U+062F - U+0632 ] range.
>
> am i wrong?
>
> Esra
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > I still think you're wrong :).
> >
> > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > ژ = U+632
> >
> > According to the website you linked to, the above character, which has
> > three dots over it, is named "zhe", and its Unicode code point is
> > U+698. (I had to increase the font size to see the three dots.)
> >
> > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > "ز"/"ze"/U+632, which has just one dot over it.
> >
> > Unless you were mistaken in all of your emails when you included the
> > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > previous email still stands: there is no problem here.
> >
> > Steve
> >
> > On 05/02/2008 at 9:31 AM, esra wrote:
> > >
> > > Hi Steven,
> > >
> > > sorry i made a mistake. unicodes are like this:
> > >
> > > > د=U+62F
> > > > ژ = U+632
> > > > and the first letter of "ساب ووفر " is  س = U+633
> > >
> > > you can also check them here
> > > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > >
> > > Esra
> > >
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > Going back to the original problem statement, I see something that
> > > > looks illogical to me - please correct me if I'm wrong:
> > > >
> > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > i am using lucene's "IndexSearcher" to search the given xml by
> > > > > keyword which contains farsi information.
> > > > > while searching i use ranges like
> > > > >
> > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > >
> > > > > when i do search for  "د-ژ"  range the results are wrong , they
> > > > > are the results of  " س-ظ "range.
> > > > >
> > > > > for example when i do search for "د-ژ"  one of the the results is
> > > > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > > > list which is the corret range.
> > > > >
> > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > >
> > > > > د=U+62F
> > > > > ژ = U+698
> > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > >
> > > > It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and
> > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب
> > > > ووفر", which is "س" = U+0633.
> > > >
> > > > You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
> > > > range - I agree - but why do you think U+0633 should not be contained
> > > > in the [ U+062F - U+0698 ] range?
> > > >
> > > > In other words, it looks to me like your problem is not a problem at
> > > > all.
> > > >
> > > > Steve
> > > >
> > > >
> > >
> > > -- View this message in context:
> > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
 .html Sent

> > from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@... For
> > additional commands, e-mail: java-user-help@...
> >
> >
>
>
>
>
>
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@...
 For additional commands, e-mail: java-user-help@...

 


RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steven ,

yes you are right, sorry i am a bit confused.

i checked again and the correct one is  "zhe"/U+698.

It seems the word is in the range but my customer says it shouldn't be.

I think problem occurs because  "zhe" is a Persian letter outside the Arabic alphabet. In farsi alphabet this letter is not after the "س" letter but it's unicode is bigger than "س" letter's and the searcher works with unicodes.

Esra

Steven A Rowe wrote:
Hi Esra,

You are *still* incorrectly referring to the glyph with three dots over it:

On 05/02/2008 at 12:18 PM, esra wrote:
> yes the correct one is "ژ "/"ze"/U+632.

"ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.

Have you increased the font size?  Can you see the difference between these two?:

"ژ"/"zhe"/U+698
"ز"/"ze"/U+632

> my problem is when i do search for  "د-ژ" range. The result
> is  "ساب ووفر" and this word's first letter is "س" and it's unicode is
> "U+633"  and it is not in the in the [ U+062F - U+0632 ] range.

Like I keep saying, in the above description, you're using the glyph "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring to it as "ze"/U+632.

I don't mean to continually bang on about this - if you're *sure* that when you search, you're using the character represented by the glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the problem lies elsewhere.

Steve

On 05/02/2008 at 12:18 PM, esra wrote:
> yes the correct one is "ژ "/"ze"/U+632.
>
> my problem is when i do search for  "  د-ژ" range. The result
> is  ""ساب ووفر
> " and this word's first letter is "س " and it's unicode is
> "U+633"  and  it
> is not in the in the [ U+062F - U+0632 ] range.
>
> am i wrong?
>
> Esra
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > I still think you're wrong :).
> >
> > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > ژ = U+632
> >
> > According to the website you linked to, the above character, which has
> > three dots over it, is named "zhe", and its Unicode code point is
> > U+698. (I had to increase the font size to see the three dots.)
> >
> > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > "ز"/"ze"/U+632, which has just one dot over it.
> >
> > Unless you were mistaken in all of your emails when you included the
> > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > previous email still stands: there is no problem here.
> >
> > Steve
> >
> > On 05/02/2008 at 9:31 AM, esra wrote:
> > >
> > > Hi Steven,
> > >
> > > sorry i made a mistake. unicodes are like this:
> > >
> > > > د=U+62F
> > > > ژ = U+632
> > > > and the first letter of "ساب ووفر " is  س = U+633
> > >
> > > you can also check them here
> > > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > >
> > > Esra
> > >
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > Going back to the original problem statement, I see something that
> > > > looks illogical to me - please correct me if I'm wrong:
> > > >
> > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > i am using lucene's "IndexSearcher" to search the given xml by
> > > > > keyword which contains farsi information.
> > > > > while searching i use ranges like
> > > > >
> > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > >
> > > > > when i do search for  "د-ژ"  range the results are wrong , they
> > > > > are the results of  " س-ظ "range.
> > > > >
> > > > > for example when i do search for "د-ژ"  one of the the results is
> > > > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > > > list which is the corret range.
> > > > >
> > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > >
> > > > > د=U+62F
> > > > > ژ = U+698
> > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > >
> > > > It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and
> > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the first letter of "ساب
> > > > ووفر", which is "س" = U+0633.
> > > >
> > > > You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
> > > > range - I agree - but why do you think U+0633 should not be contained
> > > > in the [ U+062F - U+0698 ] range?
> > > >
> > > > In other words, it looks to me like your problem is not a problem at
> > > > all.
> > > >
> > > > Steve
> > > >
> > > >
> > >
> > > -- View this message in context:
> > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
 .html Sent
> > from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>
>
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
 For additional commands, e-mail: java-user-help@lucene.apache.org

 

RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

I have created an issue for this - see <https://issues.apache.org/jira/browse/LUCENE-1279>.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:

>
> Hi Steven ,
>
> yes you are right, sorry i am a bit confused.
>
> i checked again and the correct one is  "zhe"/U+698.
>
> It seems the word is in the range but my customer says it
> shouldn't be.
>
> I think problem occurs because  "zhe" is a Persian letter
> outside the Arabic
> alphabet. In farsi alphabet this letter is not after the "س"
> letter but it's
> unicode is bigger than "س" letter's and the searcher works
> with unicodes.
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > You are *still* incorrectly referring to the glyph with three dots over
> > it:
> >
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> >
> > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> >
> > Have you increased the font size?  Can you see the difference between
> > these two?:
> >
> > "ژ"/"zhe"/U+698
> > "ز"/"ze"/U+632
> >
> > > my problem is when i do search for  "د-ژ" range. The result is  "ساب
> > > ووفر" and this word's first letter is "س" and it's unicode is "U+633"
> > > and it is not in the in the [ U+062F - U+0632 ] range.
> >
> > Like I keep saying, in the above description, you're using the glyph
> > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > to it as "ze"/U+632.
> >
> > I don't mean to continually bang on about this - if you're *sure* that
> > when you search, you're using the character represented by the glyph
> > with one dot (and not three), i.e. "ز"/"ze"/U+632, then the problem
> > lies elsewhere.
> >
> > Steve
> >
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> > >
> > > my problem is when i do search for  "  د-ژ" range. The result
> > > is  ""ساب ووفر
> > > " and this word's first letter is "س " and it's unicode is
> > > "U+633"  and  it
> > > is not in the in the [ U+062F - U+0632 ] range.
> > >
> > > am i wrong?
> > >
> > > Esra
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > I still think you're wrong :).
> > > >
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > ژ = U+632
> > > >
> > > > According to the website you linked to, the above character, which
> > > > has three dots over it, is named "zhe", and its Unicode code point is
> > > > U+698. (I had to increase the font size to see the three dots.)
> > > >
> > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > >
> > > > Unless you were mistaken in all of your emails when you included the
> > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > previous email still stands: there is no problem here.
> > > >
> > > > Steve
> > > >
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > >
> > > > > Hi Steven,
> > > > >
> > > > > sorry i made a mistake. unicodes are like this:
> > > > >
> > > > > > د=U+62F
> > > > > > ژ = U+632
> > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > >
> > > > > you can also check them here
> > > > > >
> http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > >
> > > > > Esra
> > > > >
> > > > >
> > > > > Steven A Rowe wrote:
> > > > > >
> > > > > > Hi Esra,
> > > > > >
> > > > > > Going back to the original problem statement, I see something that
> > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > >
> > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > i am using lucene's "IndexSearcher" to search the given xml by
> > > > > > > keyword which contains farsi information. while searching i use
> > > > > > > ranges like
> > > > > > >
> > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > > > >
> > > > > > > when i do search for  "د-ژ"  range the results are wrong , they
> > > > > > > are the results of  " س-ظ "range.
> > > > > > >
> > > > > > > for example when i do search for "د-ژ"  one of the the results is
> > > > > > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > > > > > list which is the corret range.
> > > > > > >
> > > > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > > > >
> > > > > > > د=U+62F
> > > > > > > ژ = U+698
> > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > >
> > > > > > It appears to me that *both* the "د-ژ" range [
> U+062F - U+0698 ]
> > > and
> > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> first letter of
> > > "ساب
> > > > > > ووفر", which is "س" = U+0633.
> > > > > >
> > > > > > You stated that U+0633 should be contained in the [
> U+0633 - U+0638
> > > ]
> > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > >
> > > > > > In other words, it looks to me like your problem is
> not a problem
> > > at
> > > > > > all.
> > > > > >
> > > > > > Steve
> > > > > >
> > > > > >
> > > > >
> > > > > -- View this message in context:
> > > > >
> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > .html Sent
> > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > To
> > > > unsubscribe, e-mail: java-user-unsubscribe@... For
> > > > additional commands, e-mail: java-user-help@...
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >  -- View this message in context:
> >  http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> >  Sent from the Lucene - Java Users mailing list archive at
> Nabble.com.
> >
> >
> >
> ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: java-user-unsubscribe@...
> >  For additional commands, e-mail: java-user-help@...
> >
> >
> >
> >
> >
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@... For
> additional commands, e-mail: java-user-help@...
>
>

 


RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steven,

thanks for your help....

Esra

Steven A Rowe wrote:
Hi Esra,

I have created an issue for this - see <https://issues.apache.org/jira/browse/LUCENE-1279>.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:
>
> Hi Steven ,
>
> yes you are right, sorry i am a bit confused.
>
> i checked again and the correct one is  "zhe"/U+698.
>
> It seems the word is in the range but my customer says it
> shouldn't be.
>
> I think problem occurs because  "zhe" is a Persian letter
> outside the Arabic
> alphabet. In farsi alphabet this letter is not after the "س"
> letter but it's
> unicode is bigger than "س" letter's and the searcher works
> with unicodes.
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > You are *still* incorrectly referring to the glyph with three dots over
> > it:
> >
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> >
> > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> >
> > Have you increased the font size?  Can you see the difference between
> > these two?:
> >
> > "ژ"/"zhe"/U+698
> > "ز"/"ze"/U+632
> >
> > > my problem is when i do search for  "د-ژ" range. The result is  "ساب
> > > ووفر" and this word's first letter is "س" and it's unicode is "U+633"
> > > and it is not in the in the [ U+062F - U+0632 ] range.
> >
> > Like I keep saying, in the above description, you're using the glyph
> > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > to it as "ze"/U+632.
> >
> > I don't mean to continually bang on about this - if you're *sure* that
> > when you search, you're using the character represented by the glyph
> > with one dot (and not three), i.e. "ز"/"ze"/U+632, then the problem
> > lies elsewhere.
> >
> > Steve
> >
> > On 05/02/2008 at 12:18 PM, esra wrote:
> > > yes the correct one is "ژ "/"ze"/U+632.
> > >
> > > my problem is when i do search for  "  د-ژ" range. The result
> > > is  ""ساب ووفر
> > > " and this word's first letter is "س " and it's unicode is
> > > "U+633"  and  it
> > > is not in the in the [ U+062F - U+0632 ] range.
> > >
> > > am i wrong?
> > >
> > > Esra
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > I still think you're wrong :).
> > > >
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > ژ = U+632
> > > >
> > > > According to the website you linked to, the above character, which
> > > > has three dots over it, is named "zhe", and its Unicode code point is
> > > > U+698. (I had to increase the font size to see the three dots.)
> > > >
> > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > >
> > > > Unless you were mistaken in all of your emails when you included the
> > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > previous email still stands: there is no problem here.
> > > >
> > > > Steve
> > > >
> > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > >
> > > > > Hi Steven,
> > > > >
> > > > > sorry i made a mistake. unicodes are like this:
> > > > >
> > > > > > د=U+62F
> > > > > > ژ = U+632
> > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > >
> > > > > you can also check them here
> > > > > >
> http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > >
> > > > > Esra
> > > > >
> > > > >
> > > > > Steven A Rowe wrote:
> > > > > >
> > > > > > Hi Esra,
> > > > > >
> > > > > > Going back to the original problem statement, I see something that
> > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > >
> > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > i am using lucene's "IndexSearcher" to search the given xml by
> > > > > > > keyword which contains farsi information. while searching i use
> > > > > > > ranges like
> > > > > > >
> > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > > > >
> > > > > > > when i do search for  "د-ژ"  range the results are wrong , they
> > > > > > > are the results of  " س-ظ "range.
> > > > > > >
> > > > > > > for example when i do search for "د-ژ"  one of the the results is
> > > > > > > "ساب ووفر", this result also shown on the " س-ظ " range's result
> > > > > > > list which is the corret range.
> > > > > > >
> > > > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > > > >
> > > > > > > د=U+62F
> > > > > > > ژ = U+698
> > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > >
> > > > > > It appears to me that *both* the "د-ژ" range [
> U+062F - U+0698 ]
> > > and
> > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> first letter of
> > > "ساب
> > > > > > ووفر", which is "س" = U+0633.
> > > > > >
> > > > > > You stated that U+0633 should be contained in the [
> U+0633 - U+0638
> > > ]
> > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > >
> > > > > > In other words, it looks to me like your problem is
> not a problem
> > > at
> > > > > > all.
> > > > > >
> > > > > > Steve
> > > > > >
> > > > > >
> > > > >
> > > > > -- View this message in context:
> > > > >
> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > .html Sent
> > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > To
> > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >  -- View this message in context:
> >  http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> >  Sent from the Lucene - Java Users mailing list archive at
> Nabble.com.
> >
> >
> >
> ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
>
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-user-help@lucene.apache.org
>
>

 

RE: lucene farsi problem

by Steven A Rowe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Esra,

I have attached a patch to LUCENE-1279 containing a new class: CollatingRangeQuery.

The patch also contains a test class: TestCollatingRangeQuery.  One of the test methods checks for the Farsi range you were having trouble with.

It should be mentioned that according to Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0 contains Farsi collation information.  However, in the test class I use the Arabic Locale, which seems to properly collate the non-Arabic Farsi letter U+0698, and hopefully behaves well with other Farsi letters.  If you find that this is not the case, you can look into writing collation rules using RuleBasedCollator - you should be able to directly specify the proper letter orderings for Farsi; CollatingRangeQuery would also have to supply a constructor that takes in a Collator instead of a Locale.

Please give the class a try and post back about how it works.

Thanks,
Steve

On 05/03/2008 at 8:33 AM, esra wrote:

>
> Hi Steven,
>
> thanks for your help....
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > I have created an issue for this - see
> > <https://issues.apache.org/jira/browse/LUCENE-1279>.
> >
> > I'll try to take a crack at a patch this weekend.
> >
> > Steve
> >
> > On 05/02/2008 at 12:55 PM, esra wrote:
> > >
> > > Hi Steven ,
> > >
> > > yes you are right, sorry i am a bit confused.
> > >
> > > i checked again and the correct one is  "zhe"/U+698.
> > >
> > > It seems the word is in the range but my customer says it
> > > shouldn't be.
> > >
> > > I think problem occurs because  "zhe" is a Persian letter
> > > outside the Arabic
> > > alphabet. In farsi alphabet this letter is not after the "س"
> > > letter but it's
> > > unicode is bigger than "س" letter's and the searcher works
> > > with unicodes.
> > >
> > > Esra
> > >
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > You are *still* incorrectly referring to the glyph with three dots
> > > > over it:
> > > >
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > >
> > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > > >
> > > > Have you increased the font size?  Can you see the difference between
> > > > these two?:
> > > >
> > > > "ژ"/"zhe"/U+698
> > > > "ز"/"ze"/U+632
> > > >
> > > > > my problem is when i do search for  "د-ژ" range. The result is  "ساب
> > > > > ووفر" and this word's first letter is "س" and it's unicode is
> > > > > "U+633" and it is not in the in the [ U+062F - U+0632 ] range.
> > > >
> > > > Like I keep saying, in the above description, you're using the glyph
> > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > > > to it as "ze"/U+632.
> > > >
> > > > I don't mean to continually bang on about this - if you're *sure*
> > > > that when you search, you're using the character represented by the
> > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the
> > > > problem lies elsewhere.
> > > >
> > > > Steve
> > > >
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > >
> > > > > my problem is when i do search for  "  د-ژ" range. The result
> > > > > is  ""ساب ووفر
> > > > > " and this word's first letter is "س " and it's unicode is
> > > > > "U+633"  and  it
> > > > > is not in the in the [ U+062F - U+0632 ] range.
> > > > >
> > > > > am i wrong?
> > > > >
> > > > > Esra
> > > > >
> > > > > Steven A Rowe wrote:
> > > > > >
> > > > > > Hi Esra,
> > > > > >
> > > > > > I still think you're wrong :).
> > > > > >
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > ژ = U+632
> > > > > >
> > > > > > According to the website you linked to, the above character, which
> > > > > > has three dots over it, is named "zhe", and its
> Unicode code point
> > > is
> > > > > > U+698. (I had to increase the font size to see the three dots.)
> > > > > >
> > > > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > > > >
> > > > > > Unless you were mistaken in all of your emails when
> you included
> > > the
> > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > > > previous email still stands: there is no problem here.
> > > > > >
> > > > > > Steve
> > > > > >
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > >
> > > > > > > Hi Steven,
> > > > > > >
> > > > > > > sorry i made a mistake. unicodes are like this:
> > > > > > >
> > > > > > > > د=U+62F
> > > > > > > > ژ = U+632
> > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > >
> > > > > > > you can also check them here
> > > > > > > >
> > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > > >
> > > > > > > Esra
> > > > > > >
> > > > > > >
> > > > > > > Steven A Rowe wrote:
> > > > > > > >
> > > > > > > > Hi Esra,
> > > > > > > >
> > > > > > > > Going back to the original problem statement, I
> see something
> > > that
> > > > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > > > >
> > > > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > > > i am using lucene's "IndexSearcher" to search
> the given xml
> > > by
> > > > > > > > > keyword which contains farsi information.
> while searching i
> > > use
> > > > > > > > > ranges like
> > > > > > > > >
> > > > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > > > > > >
> > > > > > > > > when i do search for  "د-ژ"  range the results
> are wrong ,
> > > they
> > > > > > > > > are the results of  " س-ظ "range.
> > > > > > > > >
> > > > > > > > > for example when i do search for "د-ژ"  one of the the results
> > > > > > > > > is "ساب ووفر", this result also shown on the "
> س-ظ " range's
> > > result
> > > > > > > > > list which is the corret range.
> > > > > > > > >
> > > > > > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > > > > > >
> > > > > > > > > د=U+62F
> > > > > > > > > ژ = U+698
> > > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > > >
> > > > > > > > It appears to me that *both* the "د-ژ" range [
> > > U+062F - U+0698 ]
> > > > > and
> > > > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> > > first letter of
> > > > > "ساب
> > > > > > > > ووفر", which is "س" = U+0633.
> > > > > > > >
> > > > > > > > You stated that U+0633 should be contained in the [
> > > U+0633 - U+0638
> > > > > ]
> > > > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > > > >
> > > > > > > > In other words, it looks to me like your problem is
> > > not a problem
> > > > > at
> > > > > > > > all.
> > > > > > > >
> > > > > > > > Steve
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > -- View this message in context:
> > > > > > >
> > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > > > .html Sent
> > > > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > > >
> > > > > >
> > > > > >
> > >
> ---------------------------------------------------------------------
> > > > > To
> > > > > > unsubscribe, e-mail: java-user-unsubscribe@... For
> > > > > > additional commands, e-mail: java-user-help@...
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -- View this message in context:
> > > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> > > >  Sent from the Lucene - Java Users mailing list archive at
> > > Nabble.com.
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > >  To unsubscribe, e-mail: java-user-unsubscribe@...
> > > >  For additional commands, e-mail:
> java-user-help@...
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > -- View this message in context:
> > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
 .html Sent

> > from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@... For
> > additional commands, e-mail: java-user-help@...
> >
> >
>
>
>
>
>
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@...
 For additional commands, e-mail: java-user-help@...

 


RE: lucene farsi problem

by esra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steven ,
Hi Steven,

i tried the class and it works fine with the locale parameter "ar".

Actually we are using "fa" for farsi and "ar" for arabic.
I have added a little control for the locale parameter in my code and now i can see the correct results.

Thank you very much for ypur help.

Esra.


Steven A Rowe wrote:
Hi Esra,

I have attached a patch to LUCENE-1279 containing a new class: CollatingRangeQuery.

The patch also contains a test class: TestCollatingRangeQuery.  One of the test methods checks for the Farsi range you were having trouble with.

It should be mentioned that according to Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0 contains Farsi collation information.  However, in the test class I use the Arabic Locale, which seems to properly collate the non-Arabic Farsi letter U+0698, and hopefully behaves well with other Farsi letters.  If you find that this is not the case, you can look into writing collation rules using RuleBasedCollator - you should be able to directly specify the proper letter orderings for Farsi; CollatingRangeQuery would also have to supply a constructor that takes in a Collator instead of a Locale.

Please give the class a try and post back about how it works.

Thanks,
Steve

On 05/03/2008 at 8:33 AM, esra wrote:
>
> Hi Steven,
>
> thanks for your help....
>
> Esra
>
>
> Steven A Rowe wrote:
> >
> > Hi Esra,
> >
> > I have created an issue for this - see
> > <https://issues.apache.org/jira/browse/LUCENE-1279>.
> >
> > I'll try to take a crack at a patch this weekend.
> >
> > Steve
> >
> > On 05/02/2008 at 12:55 PM, esra wrote:
> > >
> > > Hi Steven ,
> > >
> > > yes you are right, sorry i am a bit confused.
> > >
> > > i checked again and the correct one is  "zhe"/U+698.
> > >
> > > It seems the word is in the range but my customer says it
> > > shouldn't be.
> > >
> > > I think problem occurs because  "zhe" is a Persian letter
> > > outside the Arabic
> > > alphabet. In farsi alphabet this letter is not after the "س"
> > > letter but it's
> > > unicode is bigger than "س" letter's and the searcher works
> > > with unicodes.
> > >
> > > Esra
> > >
> > >
> > > Steven A Rowe wrote:
> > > >
> > > > Hi Esra,
> > > >
> > > > You are *still* incorrectly referring to the glyph with three dots
> > > > over it:
> > > >
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > >
> > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > > >
> > > > Have you increased the font size?  Can you see the difference between
> > > > these two?:
> > > >
> > > > "ژ"/"zhe"/U+698
> > > > "ز"/"ze"/U+632
> > > >
> > > > > my problem is when i do search for  "د-ژ" range. The result is  "ساب
> > > > > ووفر" and this word's first letter is "س" and it's unicode is
> > > > > "U+633" and it is not in the in the [ U+062F - U+0632 ] range.
> > > >
> > > > Like I keep saying, in the above description, you're using the glyph
> > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly referring
> > > > to it as "ze"/U+632.
> > > >
> > > > I don't mean to continually bang on about this - if you're *sure*
> > > > that when you search, you're using the character represented by the
> > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the
> > > > problem lies elsewhere.
> > > >
> > > > Steve
> > > >
> > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > >
> > > > > my problem is when i do search for  "  د-ژ" range. The result
> > > > > is  ""ساب ووفر
> > > > > " and this word's first letter is "س " and it's unicode is
> > > > > "U+633"  and  it
> > > > > is not in the in the [ U+062F - U+0632 ] range.
> > > > >
> > > > > am i wrong?
> > > > >
> > > > > Esra
> > > > >
> > > > > Steven A Rowe wrote:
> > > > > >
> > > > > > Hi Esra,
> > > > > >
> > > > > > I still think you're wrong :).
> > > > > >
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > ژ = U+632
> > > > > >
> > > > > > According to the website you linked to, the above character, which
> > > > > > has three dots over it, is named "zhe", and its
> Unicode code point
> > > is
> > > > > > U+698. (I had to increase the font size to see the three dots.)
> > > > > >
> > > > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > > > >
> > > > > > Unless you were mistaken in all of your emails when
> you included
> > > the
> > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > > > previous email still stands: there is no problem here.
> > > > > >
> > > > > > Steve
> > > > > >
> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > >
> > > > > > > Hi Steven,
> > > > > > >
> > > > > > > sorry i made a mistake. unicodes are like this:
> > > > > > >
> > > > > > > > د=U+62F
> > > > > > > > ژ = U+632
> > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > >
> > > > > > > you can also check them here
> > > > > > > >
> > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > > >
> > > > > > > Esra
> > > > > > >
> > > > > > >
> > > > > > > Steven A Rowe wrote:
> > > > > > > >
> > > > > > > > Hi Esra,
> > > > > > > >
> > > > > > > > Going back to the original problem statement, I
> see something
> > > that
> > > > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > > > >
> > > > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > > > i am using lucene's "IndexSearcher" to search
> the given xml
> > > by
> > > > > > > > > keyword which contains farsi information.
> while searching i
> > > use
> > > > > > > > > ranges like
> > > > > > > > >
> > > > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > > > > > >
> > > > > > > > > when i do search for  "د-ژ"  range the results
> are wrong ,
> > > they
> > > > > > > > > are the results of  " س-ظ "range.
> > > > > > > > >
> > > > > > > > > for example when i do search for "د-ژ"  one of the the results
> > > > > > > > > is "ساب ووفر", this result also shown on the "
> س-ظ " range's
> > > result
> > > > > > > > > list which is the corret range.
> > > > > > > > >
> > > > > > > > > As IndexSearcher use "compareTo" method and this method uses
> > > > > > > > > unicodes for comparing, i found the unicodes of the characters.
> > > > > > > > >
> > > > > > > > > د=U+62F
> > > > > > > > > ژ = U+698
> > > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > > >
> > > > > > > > It appears to me that *both* the "د-ژ" range [
> > > U+062F - U+0698 ]
> > > > > and
> > > > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> > > first letter of
> > > > > "ساب
> > > > > > > > ووفر", which is "س" = U+0633.
> > > > > > > >
> > > > > > > > You stated that U+0633 should be contained in the [
> > > U+0633 - U+0638
> > > > > ]
> > > > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > > > >
> > > > > > > > In other words, it looks to me like your problem is
> > > not a problem
> > > > > at
> > > > > > > > all.
> > > > > > > >
> > > > > > > > Steve
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > -- View this message in context:
> > > > > > >
> > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > > > .html Sent
> > > > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > > >
> > > > > >
> > > > > >
> > >
> ---------------------------------------------------------------------
> > > > > To
> > > > > > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > > > > > additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -- View this message in context:
> > > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> > > >  Sent from the Lucene - Java Users mailing list archive at
> > > Nabble.com.
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >  For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > -- View this message in context:
> > >
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
 .html Sent
> > from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
> > additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>
>
 
 --
 View this message in context: http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
 For additional commands, e-mail: java-user-help@lucene.apache.org

 

Re: lucene farsi problem

by Vizzini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry for cross posting, but why the word 'Farsi' instead of 'Persian'?  No one says Lucnce français or Español, or Deutsch - so why Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

PV
< Prev | 1 - 2 | Next >