Preparing the ground for a real multilang index

View: New views
11 Messages — Rating Filter:   Alert me  

Preparing the ground for a real multilang index

by Michael Lackhoff-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
what with exotic languages? Use a catch all "language" without a stemmer?

Now a user searches for TITLE:term and I don't know beforehand the
language of "term". Do I have to expand the query to something like
"TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
some sort of copyfield for analyzed fields? Then I could just copy all
the TITLE_* fields to TITLE and don't bother with the language of the query.

Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael

Re: Preparing the ground for a real multilang index

by Otis Gospodnetic :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Michael,

I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend.... unless your queries are sufficiently long that their language can be identified.

Here is a handy tool for playing with language identification:

  http://www.sematext.com/demo/lid/

You'll see how hard it is to guess a language of very short texts. :)
You really want to avoid that huge OR.  Often it makes no sense to OR in multilingual context.  Think about the word "die" (English and German, as you know) and what happens when you include that in an OR.  And does it make sense to include a "very language specific word", say "wunderbar", in an OR that goes across multiple/all languages?  Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar


Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Michael Lackhoff <michael@...>
> To: solr-user@...
> Sent: Thursday, July 2, 2009 2:58:41 PM
> Subject: Preparing the ground for a real multilang index
>
> As pointed out in the recent thread about stemmers and other language
> specifics I should handle them all in their own right. But how?
>
> The first problem is how to know the language. Sometimes I have a
> language identifier within the record, sometimes I have more than one,
> sometimes I have none. How should I handle the non-obvious cases?
>
> Given I somehow know record1 is English and record2 is German. Then I
> need all my (relevant) fields for every language, e.g. I will have
> TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
> what with exotic languages? Use a catch all "language" without a stemmer?
>
> Now a user searches for TITLE:term and I don't know beforehand the
> language of "term". Do I have to expand the query to something like
> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
> some sort of copyfield for analyzed fields? Then I could just copy all
> the TITLE_* fields to TITLE and don't bother with the language of the query.
>
> Are there any solutions that prevent an index with thousands of fields
> and dozens of ORed query terms?
>
> I know I will have to implement some better multilanguage support but
> would also like to keep it as simple as possible.
>
> -Michael


Re: Preparing the ground for a real multilang index

by Walter Underwood, Netflix :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Not to mention Americans who call themselves "wunder". Or brand names, like
LaserJet, which are the same in all languages. Queries are far too short for
effective language id.

You can get language preferences from an HTTP request headers, then allow
people to override them. I think the header is Accept-language, but it has
been a long time since I did that.

I recommend using ISO language codes, en, de, es, fr, and so on, instead of
making up your own, like eng and ger. Don't confuse them with ISO country
codes: uk, us, etc. Korean and Japanese are easy to mix up with the country
codes.

wunder

On 7/2/09 1:15 PM, "Otis Gospodnetic" <otis_gospodnetic@...> wrote:

>
> Michael,
>
> I think you really aught to know the language of the query (from a pulldown,
> from the browser, from user settings, somewhere) and pass that to the
> backend.... unless your queries are sufficiently long that their language can
> be identified.
>
> Here is a handy tool for playing with language identification:
>
>   http://www.sematext.com/demo/lid/
>
> You'll see how hard it is to guess a language of very short texts. :)
> You really want to avoid that huge OR.  Often it makes no sense to OR in
> multilingual context.  Think about the word "die" (English and German, as you
> know) and what happens when you include that in an OR.  And does it make sense
> to include a "very language specific word", say "wunderbar", in an OR that
> goes across multiple/all languages?  Funny, they have it listed at
> http://www.merriam-webster.com/dictionary/wunderbar
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Michael Lackhoff <michael@...>
>> To: solr-user@...
>> Sent: Thursday, July 2, 2009 2:58:41 PM
>> Subject: Preparing the ground for a real multilang index
>>
>> As pointed out in the recent thread about stemmers and other language
>> specifics I should handle them all in their own right. But how?
>>
>> The first problem is how to know the language. Sometimes I have a
>> language identifier within the record, sometimes I have more than one,
>> sometimes I have none. How should I handle the non-obvious cases?
>>
>> Given I somehow know record1 is English and record2 is German. Then I
>> need all my (relevant) fields for every language, e.g. I will have
>> TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
>> what with exotic languages? Use a catch all "language" without a stemmer?
>>
>> Now a user searches for TITLE:term and I don't know beforehand the
>> language of "term". Do I have to expand the query to something like
>> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
>> some sort of copyfield for analyzed fields? Then I could just copy all
>> the TITLE_* fields to TITLE and don't bother with the language of the query.
>>
>> Are there any solutions that prevent an index with thousands of fields
>> and dozens of ORed query terms?
>>
>> I know I will have to implement some better multilanguage support but
>> would also like to keep it as simple as possible.
>>
>> -Michael
>


Re: Preparing the ground for a real multilang index

by polx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I believe the proper way is for the server to compute a list of  
accepted languages in order of preferences.
The web-platform language (e.g. the user-setting), and the values in  
the Accept-Language http header (which are from the browser or  
platform).

Then you expand your query for surfing waves (say) to:
- phrase query: surfing waves exactly (^2.0)
- two terms, no stemming: surfing waves (^1.5)
- iterate through the languages and query for stemmed variants:
   - english: surf wav ^1.0
   - german surfing wave ^0.9
   - ....
- then maybe even try the phonetic analyzer (matched in a separate  
field probably)

I think this is a common pattern on the web where the users, browsers,  
and servers are all somewhat multilingual.

paul

Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit :

>
> Michael,
>
> I think you really aught to know the language of the query (from a  
> pulldown, from the browser, from user settings, somewhere) and pass  
> that to the backend.... unless your queries are sufficiently long  
> that their language can be identified.
>
> Here is a handy tool for playing with language identification:
>
>  http://www.sematext.com/demo/lid/
>
> You'll see how hard it is to guess a language of very short texts. :)
> You really want to avoid that huge OR.  Often it makes no sense to  
> OR in multilingual context.  Think about the word "die" (English and  
> German, as you know) and what happens when you include that in an  
> OR.  And does it make sense to include a "very language specific  
> word", say "wunderbar", in an OR that goes across multiple/all  
> languages?  Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Michael Lackhoff <michael@...>
>> To: solr-user@...
>> Sent: Thursday, July 2, 2009 2:58:41 PM
>> Subject: Preparing the ground for a real multilang index
>>
>> As pointed out in the recent thread about stemmers and other language
>> specifics I should handle them all in their own right. But how?
>>
>> The first problem is how to know the language. Sometimes I have a
>> language identifier within the record, sometimes I have more than  
>> one,
>> sometimes I have none. How should I handle the non-obvious cases?
>>
>> Given I somehow know record1 is English and record2 is German. Then I
>> need all my (relevant) fields for every language, e.g. I will have
>> TITLE_ENG and TITLE_GER and both will have their respective  
>> stemmer. But
>> what with exotic languages? Use a catch all "language" without a  
>> stemmer?
>>
>> Now a user searches for TITLE:term and I don't know beforehand the
>> language of "term". Do I have to expand the query to something like
>> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is  
>> there
>> some sort of copyfield for analyzed fields? Then I could just copy  
>> all
>> the TITLE_* fields to TITLE and don't bother with the language of  
>> the query.
>>
>> Are there any solutions that prevent an index with thousands of  
>> fields
>> and dozens of ORed query terms?
>>
>> I know I will have to implement some better multilanguage support but
>> would also like to keep it as simple as possible.
>>
>> -Michael
>


smime.p7s (2K) Download Attachment

Re: Preparing the ground for a real multilang index

by Michael Lackhoff-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 03.07.2009 00:49 Paul Libbrecht wrote:

[I'll try to address the other responses as well]

> I believe the proper way is for the server to compute a list of  
> accepted languages in order of preferences.
> The web-platform language (e.g. the user-setting), and the values in  
> the Accept-Language http header (which are from the browser or  
> platform).

All this is not going to help much because the main application is a
scientific search portal for books and articles with many users
searching cross-language. The most typical use case is a German user
searching multilingual. So we might even get the search multilingual,
e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
Accept-headers or a language select field (would be left on "any" in
most cases). Other popular use cases are citations (in whatever
language) cut and pasted into the search field.

> Then you expand your query for surfing waves (say) to:
> - phrase query: surfing waves exactly (^2.0)
> - two terms, no stemming: surfing waves (^1.5)
> - iterate through the languages and query for stemmed variants:
>    - english: surf wav ^1.0
>    - german surfing wave ^0.9
>    - ....
> - then maybe even try the phonetic analyzer (matched in a separate  
> field probably)

This is an even more sophisticated variant of the multiple "OR" I came
up with. Oh well...

> I think this is a common pattern on the web where the users, browsers,  
> and servers are all somewhat multilingual.

indeed and often users are not even aware of it, especially in a
scientific context they use their native tongue and English almost
interchangably -- and they expect the search engine to cope with it.

I think the best would be to process the data according to its language
but don't make any assumptions about the query language and I am totally
lost how to get a clever schema.xml out of all this.

Thanks everyone for listening and I am still open for good suggestions
to deal with this problem!

-Michael

Re: Preparing the ground for a real multilang index

by polx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Le 03-juil.-09 à 07:43, Michael Lackhoff a écrit :

> On 03.07.2009 00:49 Paul Libbrecht wrote:
>
> [I'll try to address the other responses as well]
>
>> I believe the proper way is for the server to compute a list of
>> accepted languages in order of preferences.
>> The web-platform language (e.g. the user-setting), and the values in
>> the Accept-Language http header (which are from the browser or
>> platform).
>
> All this is not going to help much because the main application is a
> scientific search portal for books and articles with many users
> searching cross-language. The most typical use case is a German user
> searching multilingual. So we might even get the search multilingual,
> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
> Accept-headers or a language select field (would be left on "any" in
> most cases). Other popular use cases are citations (in whatever
> language) cut and pasted into the search field.
The algorithm I described does take all this in account: the ambiguity  
of the query's language.
You have no other way to offer any form of stemming in each language  
(e.g. removing -ing and removing -ung) than to actually do this.
Is it because you use solr directly that languages can't be passed  
around?
You need a server part to get the headers, indeed.
Oh, and yes, you have to double all what I described to prefer matches  
in the title btw.
We've implemented something that might be close to what you're search,  
i2geo search which approaches much closer the cross-lingual problem by  
request entity designation:
It's under APL.

  Try to search for, say, Viereck in the search box. See a little  
description at:
   http://i2geo.net/xwiki/bin/view/About/GeoSkills
>
> I think the best would be to process the data according to its  
> language
> but don't make any assumptions about the query language and I am  
> totally
> lost how to get a clever schema.xml out of all this.

just or them properly.
Storing different languages in different fields (title-de, title-en)  
is the right way to get the schema.xml properly configured with an  
analyzer I think.

paul

smime.p7s (2K) Download Attachment

Re: Preparing the ground for a real multilang index

by Jan Høydahl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

When using stemming, you have to know the query language.
For your project, perhaps you should look into switching to a  
lemmatizer instead. I believe Lucid can provide integration with a  
commercial lemmatizer. This way you can expand the document field  
itself and do not need to know the query language. You may then want  
to do a copyfield from all your text_<lang> -> text for convenient one-
field-to-rule-them-all search.

--
Jan Høydahl
Gründer & senior architect
Cominvent AS, Stabekk, Norway
www.cominvent.com
+20 100930908

On 3. juli. 2009, at 08.43, Michael Lackhoff wrote:

> On 03.07.2009 00:49 Paul Libbrecht wrote:
>
> [I'll try to address the other responses as well]
>
>> I believe the proper way is for the server to compute a list of
>> accepted languages in order of preferences.
>> The web-platform language (e.g. the user-setting), and the values in
>> the Accept-Language http header (which are from the browser or
>> platform).
>
> All this is not going to help much because the main application is a
> scientific search portal for books and articles with many users
> searching cross-language. The most typical use case is a German user
> searching multilingual. So we might even get the search multilingual,
> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
> Accept-headers or a language select field (would be left on "any" in
> most cases). Other popular use cases are citations (in whatever
> language) cut and pasted into the search field.
>
>> Then you expand your query for surfing waves (say) to:
>> - phrase query: surfing waves exactly (^2.0)
>> - two terms, no stemming: surfing waves (^1.5)
>> - iterate through the languages and query for stemmed variants:
>>   - english: surf wav ^1.0
>>   - german surfing wave ^0.9
>>   - ....
>> - then maybe even try the phonetic analyzer (matched in a separate
>> field probably)
>
> This is an even more sophisticated variant of the multiple "OR" I came
> up with. Oh well...
>
>> I think this is a common pattern on the web where the users,  
>> browsers,
>> and servers are all somewhat multilingual.
>
> indeed and often users are not even aware of it, especially in a
> scientific context they use their native tongue and English almost
> interchangably -- and they expect the search engine to cope with it.
>
> I think the best would be to process the data according to its  
> language
> but don't make any assumptions about the query language and I am  
> totally
> lost how to get a clever schema.xml out of all this.
>
> Thanks everyone for listening and I am still open for good suggestions
> to deal with this problem!
>
> -Michael


Re: Preparing the ground for a real multilang index

by bimargulies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There is an alternative to knowing the language at query:
multiply-process for stems or lemmas of all the possible languages.
This may well be a cure much worse than the disease.

Yes, LI can sell you our lemma-production capability.

--benson margulies
basis technology




On Tue, Jul 7, 2009 at 6:50 PM, Jan Høydahl<jh@...> wrote:

> When using stemming, you have to know the query language.
> For your project, perhaps you should look into switching to a lemmatizer
> instead. I believe Lucid can provide integration with a commercial
> lemmatizer. This way you can expand the document field itself and do not
> need to know the query language. You may then want to do a copyfield from
> all your text_<lang> -> text for convenient one-field-to-rule-them-all
> search.
>
> --
> Jan Høydahl
> Gründer & senior architect
> Cominvent AS, Stabekk, Norway
> www.cominvent.com
> +20 100930908
>
> On 3. juli. 2009, at 08.43, Michael Lackhoff wrote:
>
>> On 03.07.2009 00:49 Paul Libbrecht wrote:
>>
>> [I'll try to address the other responses as well]
>>
>>> I believe the proper way is for the server to compute a list of
>>> accepted languages in order of preferences.
>>> The web-platform language (e.g. the user-setting), and the values in
>>> the Accept-Language http header (which are from the browser or
>>> platform).
>>
>> All this is not going to help much because the main application is a
>> scientific search portal for books and articles with many users
>> searching cross-language. The most typical use case is a German user
>> searching multilingual. So we might even get the search multilingual,
>> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
>> Accept-headers or a language select field (would be left on "any" in
>> most cases). Other popular use cases are citations (in whatever
>> language) cut and pasted into the search field.
>>
>>> Then you expand your query for surfing waves (say) to:
>>> - phrase query: surfing waves exactly (^2.0)
>>> - two terms, no stemming: surfing waves (^1.5)
>>> - iterate through the languages and query for stemmed variants:
>>>  - english: surf wav ^1.0
>>>  - german surfing wave ^0.9
>>>  - ....
>>> - then maybe even try the phonetic analyzer (matched in a separate
>>> field probably)
>>
>> This is an even more sophisticated variant of the multiple "OR" I came
>> up with. Oh well...
>>
>>> I think this is a common pattern on the web where the users, browsers,
>>> and servers are all somewhat multilingual.
>>
>> indeed and often users are not even aware of it, especially in a
>> scientific context they use their native tongue and English almost
>> interchangably -- and they expect the search engine to cope with it.
>>
>> I think the best would be to process the data according to its language
>> but don't make any assumptions about the query language and I am totally
>> lost how to get a clever schema.xml out of all this.
>>
>> Thanks everyone for listening and I am still open for good suggestions
>> to deal with this problem!
>>
>> -Michael
>
>

Re: Preparing the ground for a real multilang index

by Michael Lackhoff-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 08.07.2009 00:50 Jan Høydahl wrote:

> itself and do not need to know the query language. You may then want  
> to do a copyfield from all your text_<lang> -> text for convenient one-
> field-to-rule-them-all search.

Would that really help? As I understand it, copyfield takes the raw, not
yet analyzed field value. I cannot see yet the advantage of this
"text"-field over the current situation with no text_<lang> fields at all.
The copied-to text field has to be language agnostic with no stemming at
all, so it would miss many hits. Or is there a way to combine many
differently stemmed variants into one field to be able to search against
all of them at once? That would be great indeed!

-Michael

Re: Preparing the ground for a real multilang index

by polx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Can't the copy field use a different analyzer?
Both for query and indexing?
Otherwise you need to craft your own analyzer which reads the language  
from the field-name... there's several classes ready for this.

paul

Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit :

> On 08.07.2009 00:50 Jan Høydahl wrote:
>
>> itself and do not need to know the query language. You may then want
>> to do a copyfield from all your text_<lang> -> text for convenient  
>> one-
>> field-to-rule-them-all search.
>
> Would that really help? As I understand it, copyfield takes the raw,  
> not
> yet analyzed field value. I cannot see yet the advantage of this
> "text"-field over the current situation with no text_<lang> fields  
> at all.
> The copied-to text field has to be language agnostic with no  
> stemming at
> all, so it would miss many hits. Or is there a way to combine many
> differently stemmed variants into one field to be able to search  
> against
> all of them at once? That would be great indeed!
>
> -Michael


smime.p7s (2K) Download Attachment

Re: Preparing the ground for a real multilang index

by Jan Høydahl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Michael, you're of course right, copyfield would copy from source.
The lack of built-in language awareness in Solr is unfortunate :(
I have not tried Lucid's BasisTech lemmatizer implementation, but check
with them whether they can support multi languages in the same field.

--
Jan Høydahl
On 8. juli. 2009, at 16.32, Paul Libbrecht wrote:

> Can't the copy field use a different analyzer?
> Both for query and indexing?
> Otherwise you need to craft your own analyzer which reads the  
> language from the field-name... there's several classes ready for  
> this.
>
> paul
>
> Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit :
>
>> On 08.07.2009 00:50 Jan Høydahl wrote:
>>
>>> itself and do not need to know the query language. You may then want
>>> to do a copyfield from all your text_<lang> -> text for convenient  
>>> one-
>>> field-to-rule-them-all search.
>>
>> Would that really help? As I understand it, copyfield takes the  
>> raw, not
>> yet analyzed field value. I cannot see yet the advantage of this
>> "text"-field over the current situation with no text_<lang> fields  
>> at all.
>> The copied-to text field has to be language agnostic with no  
>> stemming at
>> all, so it would miss many hits. Or is there a way to combine many
>> differently stemmed variants into one field to be able to search  
>> against
>> all of them at once? That would be great indeed!
>>
>> -Michael
>