« Return to Thread: Preparing the ground for a real multilang index

Re: Preparing the ground for a real multilang index

by polx :: Rate this Message:

Reply to Author | View in Thread

I believe the proper way is for the server to compute a list of  
accepted languages in order of preferences.
The web-platform language (e.g. the user-setting), and the values in  
the Accept-Language http header (which are from the browser or  
platform).

Then you expand your query for surfing waves (say) to:
- phrase query: surfing waves exactly (^2.0)
- two terms, no stemming: surfing waves (^1.5)
- iterate through the languages and query for stemmed variants:
   - english: surf wav ^1.0
   - german surfing wave ^0.9
   - ....
- then maybe even try the phonetic analyzer (matched in a separate  
field probably)

I think this is a common pattern on the web where the users, browsers,  
and servers are all somewhat multilingual.

paul

Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit :

>
> Michael,
>
> I think you really aught to know the language of the query (from a  
> pulldown, from the browser, from user settings, somewhere) and pass  
> that to the backend.... unless your queries are sufficiently long  
> that their language can be identified.
>
> Here is a handy tool for playing with language identification:
>
>  http://www.sematext.com/demo/lid/
>
> You'll see how hard it is to guess a language of very short texts. :)
> You really want to avoid that huge OR.  Often it makes no sense to  
> OR in multilingual context.  Think about the word "die" (English and  
> German, as you know) and what happens when you include that in an  
> OR.  And does it make sense to include a "very language specific  
> word", say "wunderbar", in an OR that goes across multiple/all  
> languages?  Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Michael Lackhoff <michael@...>
>> To: solr-user@...
>> Sent: Thursday, July 2, 2009 2:58:41 PM
>> Subject: Preparing the ground for a real multilang index
>>
>> As pointed out in the recent thread about stemmers and other language
>> specifics I should handle them all in their own right. But how?
>>
>> The first problem is how to know the language. Sometimes I have a
>> language identifier within the record, sometimes I have more than  
>> one,
>> sometimes I have none. How should I handle the non-obvious cases?
>>
>> Given I somehow know record1 is English and record2 is German. Then I
>> need all my (relevant) fields for every language, e.g. I will have
>> TITLE_ENG and TITLE_GER and both will have their respective  
>> stemmer. But
>> what with exotic languages? Use a catch all "language" without a  
>> stemmer?
>>
>> Now a user searches for TITLE:term and I don't know beforehand the
>> language of "term". Do I have to expand the query to something like
>> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is  
>> there
>> some sort of copyfield for analyzed fields? Then I could just copy  
>> all
>> the TITLE_* fields to TITLE and don't bother with the language of  
>> the query.
>>
>> Are there any solutions that prevent an index with thousands of  
>> fields
>> and dozens of ORed query terms?
>>
>> I know I will have to implement some better multilanguage support but
>> would also like to keep it as simple as possible.
>>
>> -Michael
>


smime.p7s (2K) Download Attachment

 « Return to Thread: Preparing the ground for a real multilang index