>
> Michael,
>
> I think you really aught to know the language of the query (from a
> pulldown, from the browser, from user settings, somewhere) and pass
> that to the backend.... unless your queries are sufficiently long
> that their language can be identified.
>
> Here is a handy tool for playing with language identification:
>
>
http://www.sematext.com/demo/lid/>
> You'll see how hard it is to guess a language of very short texts. :)
> You really want to avoid that huge OR. Often it makes no sense to
> OR in multilingual context. Think about the word "die" (English and
> German, as you know) and what happens when you include that in an
> OR. And does it make sense to include a "very language specific
> word", say "wunderbar", in an OR that goes across multiple/all
> languages? Funny, they have it listed at
http://www.merriam-webster.com/dictionary/wunderbar>
>
> Otis--
> Sematext --
http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Michael Lackhoff <
michael@...>
>> To:
solr-user@...
>> Sent: Thursday, July 2, 2009 2:58:41 PM
>> Subject: Preparing the ground for a real multilang index
>>
>> As pointed out in the recent thread about stemmers and other language
>> specifics I should handle them all in their own right. But how?
>>
>> The first problem is how to know the language. Sometimes I have a
>> language identifier within the record, sometimes I have more than
>> one,
>> sometimes I have none. How should I handle the non-obvious cases?
>>
>> Given I somehow know record1 is English and record2 is German. Then I
>> need all my (relevant) fields for every language, e.g. I will have
>> TITLE_ENG and TITLE_GER and both will have their respective
>> stemmer. But
>> what with exotic languages? Use a catch all "language" without a
>> stemmer?
>>
>> Now a user searches for TITLE:term and I don't know beforehand the
>> language of "term". Do I have to expand the query to something like
>> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is
>> there
>> some sort of copyfield for analyzed fields? Then I could just copy
>> all
>> the TITLE_* fields to TITLE and don't bother with the language of
>> the query.
>>
>> Are there any solutions that prevent an index with thousands of
>> fields
>> and dozens of ORed query terms?
>>
>> I know I will have to implement some better multilanguage support but
>> would also like to keep it as simple as possible.
>>
>> -Michael
>