|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
|
|
Preparing the ground for a real multilang indexAs pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all "language" without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of "term". Do I have to expand the query to something like "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael |
|
|
Re: Preparing the ground for a real multilang indexMichael, I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend.... unless your queries are sufficiently long that their language can be identified. Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :) You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word "die" (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a "very language specific word", say "wunderbar", in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Michael Lackhoff <michael@...> > To: solr-user@... > Sent: Thursday, July 2, 2009 2:58:41 PM > Subject: Preparing the ground for a real multilang index > > As pointed out in the recent thread about stemmers and other language > specifics I should handle them all in their own right. But how? > > The first problem is how to know the language. Sometimes I have a > language identifier within the record, sometimes I have more than one, > sometimes I have none. How should I handle the non-obvious cases? > > Given I somehow know record1 is English and record2 is German. Then I > need all my (relevant) fields for every language, e.g. I will have > TITLE_ENG and TITLE_GER and both will have their respective stemmer. But > what with exotic languages? Use a catch all "language" without a stemmer? > > Now a user searches for TITLE:term and I don't know beforehand the > language of "term". Do I have to expand the query to something like > "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there > some sort of copyfield for analyzed fields? Then I could just copy all > the TITLE_* fields to TITLE and don't bother with the language of the query. > > Are there any solutions that prevent an index with thousands of fields > and dozens of ORed query terms? > > I know I will have to implement some better multilanguage support but > would also like to keep it as simple as possible. > > -Michael |
|
|
Re: Preparing the ground for a real multilang indexNot to mention Americans who call themselves "wunder". Or brand names, like
LaserJet, which are the same in all languages. Queries are far too short for effective language id. You can get language preferences from an HTTP request headers, then allow people to override them. I think the header is Accept-language, but it has been a long time since I did that. I recommend using ISO language codes, en, de, es, fr, and so on, instead of making up your own, like eng and ger. Don't confuse them with ISO country codes: uk, us, etc. Korean and Japanese are easy to mix up with the country codes. wunder On 7/2/09 1:15 PM, "Otis Gospodnetic" <otis_gospodnetic@...> wrote: > > Michael, > > I think you really aught to know the language of the query (from a pulldown, > from the browser, from user settings, somewhere) and pass that to the > backend.... unless your queries are sufficiently long that their language can > be identified. > > Here is a handy tool for playing with language identification: > > http://www.sematext.com/demo/lid/ > > You'll see how hard it is to guess a language of very short texts. :) > You really want to avoid that huge OR. Often it makes no sense to OR in > multilingual context. Think about the word "die" (English and German, as you > know) and what happens when you include that in an OR. And does it make sense > to include a "very language specific word", say "wunderbar", in an OR that > goes across multiple/all languages? Funny, they have it listed at > http://www.merriam-webster.com/dictionary/wunderbar > > > Otis-- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Michael Lackhoff <michael@...> >> To: solr-user@... >> Sent: Thursday, July 2, 2009 2:58:41 PM >> Subject: Preparing the ground for a real multilang index >> >> As pointed out in the recent thread about stemmers and other language >> specifics I should handle them all in their own right. But how? >> >> The first problem is how to know the language. Sometimes I have a >> language identifier within the record, sometimes I have more than one, >> sometimes I have none. How should I handle the non-obvious cases? >> >> Given I somehow know record1 is English and record2 is German. Then I >> need all my (relevant) fields for every language, e.g. I will have >> TITLE_ENG and TITLE_GER and both will have their respective stemmer. But >> what with exotic languages? Use a catch all "language" without a stemmer? >> >> Now a user searches for TITLE:term and I don't know beforehand the >> language of "term". Do I have to expand the query to something like >> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there >> some sort of copyfield for analyzed fields? Then I could just copy all >> the TITLE_* fields to TITLE and don't bother with the language of the query. >> >> Are there any solutions that prevent an index with thousands of fields >> and dozens of ORed query terms? >> >> I know I will have to implement some better multilanguage support but >> would also like to keep it as simple as possible. >> >> -Michael > |
|
|
Re: Preparing the ground for a real multilang indexI believe the proper way is for the server to compute a list of
accepted languages in order of preferences. The web-platform language (e.g. the user-setting), and the values in the Accept-Language http header (which are from the browser or platform). Then you expand your query for surfing waves (say) to: - phrase query: surfing waves exactly (^2.0) - two terms, no stemming: surfing waves (^1.5) - iterate through the languages and query for stemmed variants: - english: surf wav ^1.0 - german surfing wave ^0.9 - .... - then maybe even try the phonetic analyzer (matched in a separate field probably) I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual. paul Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit : > > Michael, > > I think you really aught to know the language of the query (from a > pulldown, from the browser, from user settings, somewhere) and pass > that to the backend.... unless your queries are sufficiently long > that their language can be identified. > > Here is a handy tool for playing with language identification: > > http://www.sematext.com/demo/lid/ > > You'll see how hard it is to guess a language of very short texts. :) > You really want to avoid that huge OR. Often it makes no sense to > OR in multilingual context. Think about the word "die" (English and > German, as you know) and what happens when you include that in an > OR. And does it make sense to include a "very language specific > word", say "wunderbar", in an OR that goes across multiple/all > languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar > > > Otis-- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Michael Lackhoff <michael@...> >> To: solr-user@... >> Sent: Thursday, July 2, 2009 2:58:41 PM >> Subject: Preparing the ground for a real multilang index >> >> As pointed out in the recent thread about stemmers and other language >> specifics I should handle them all in their own right. But how? >> >> The first problem is how to know the language. Sometimes I have a >> language identifier within the record, sometimes I have more than >> one, >> sometimes I have none. How should I handle the non-obvious cases? >> >> Given I somehow know record1 is English and record2 is German. Then I >> need all my (relevant) fields for every language, e.g. I will have >> TITLE_ENG and TITLE_GER and both will have their respective >> stemmer. But >> what with exotic languages? Use a catch all "language" without a >> stemmer? >> >> Now a user searches for TITLE:term and I don't know beforehand the >> language of "term". Do I have to expand the query to something like >> "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is >> there >> some sort of copyfield for analyzed fields? Then I could just copy >> all >> the TITLE_* fields to TITLE and don't bother with the language of >> the query. >> >> Are there any solutions that prevent an index with thousands of >> fields >> and dozens of ORed query terms? >> >> I know I will have to implement some better multilanguage support but >> would also like to keep it as simple as possible. >> >> -Michael > |
|
|
Re: Preparing the ground for a real multilang indexOn 03.07.2009 00:49 Paul Libbrecht wrote:
[I'll try to address the other responses as well] > I believe the proper way is for the server to compute a list of > accepted languages in order of preferences. > The web-platform language (e.g. the user-setting), and the values in > the Accept-Language http header (which are from the browser or > platform). All this is not going to help much because the main application is a scientific search portal for books and articles with many users searching cross-language. The most typical use case is a German user searching multilingual. So we might even get the search multilingual, e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for Accept-headers or a language select field (would be left on "any" in most cases). Other popular use cases are citations (in whatever language) cut and pasted into the search field. > Then you expand your query for surfing waves (say) to: > - phrase query: surfing waves exactly (^2.0) > - two terms, no stemming: surfing waves (^1.5) > - iterate through the languages and query for stemmed variants: > - english: surf wav ^1.0 > - german surfing wave ^0.9 > - .... > - then maybe even try the phonetic analyzer (matched in a separate > field probably) This is an even more sophisticated variant of the multiple "OR" I came up with. Oh well... > I think this is a common pattern on the web where the users, browsers, > and servers are all somewhat multilingual. indeed and often users are not even aware of it, especially in a scientific context they use their native tongue and English almost interchangably -- and they expect the search engine to cope with it. I think the best would be to process the data according to its language but don't make any assumptions about the query language and I am totally lost how to get a clever schema.xml out of all this. Thanks everyone for listening and I am still open for good suggestions to deal with this problem! -Michael |
|
|
Re: Preparing the ground for a real multilang indexLe 03-juil.-09 à 07:43, Michael Lackhoff a écrit : > On 03.07.2009 00:49 Paul Libbrecht wrote: > > [I'll try to address the other responses as well] > >> I believe the proper way is for the server to compute a list of >> accepted languages in order of preferences. >> The web-platform language (e.g. the user-setting), and the values in >> the Accept-Language http header (which are from the browser or >> platform). > > All this is not going to help much because the main application is a > scientific search portal for books and articles with many users > searching cross-language. The most typical use case is a German user > searching multilingual. So we might even get the search multilingual, > e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for > Accept-headers or a language select field (would be left on "any" in > most cases). Other popular use cases are citations (in whatever > language) cut and pasted into the search field. of the query's language. You have no other way to offer any form of stemming in each language (e.g. removing -ing and removing -ung) than to actually do this. Is it because you use solr directly that languages can't be passed around? You need a server part to get the headers, indeed. Oh, and yes, you have to double all what I described to prefer matches in the title btw. We've implemented something that might be close to what you're search, i2geo search which approaches much closer the cross-lingual problem by request entity designation: It's under APL. Try to search for, say, Viereck in the search box. See a little description at: http://i2geo.net/xwiki/bin/view/About/GeoSkills > > I think the best would be to process the data according to its > language > but don't make any assumptions about the query language and I am > totally > lost how to get a clever schema.xml out of all this. just or them properly. Storing different languages in different fields (title-de, title-en) is the right way to get the schema.xml properly configured with an analyzer I think. paul |
|
|
Re: Preparing the ground for a real multilang indexWhen using stemming, you have to know the query language.
For your project, perhaps you should look into switching to a lemmatizer instead. I believe Lucid can provide integration with a commercial lemmatizer. This way you can expand the document field itself and do not need to know the query language. You may then want to do a copyfield from all your text_<lang> -> text for convenient one- field-to-rule-them-all search. -- Jan Høydahl Gründer & senior architect Cominvent AS, Stabekk, Norway www.cominvent.com +20 100930908 On 3. juli. 2009, at 08.43, Michael Lackhoff wrote: > On 03.07.2009 00:49 Paul Libbrecht wrote: > > [I'll try to address the other responses as well] > >> I believe the proper way is for the server to compute a list of >> accepted languages in order of preferences. >> The web-platform language (e.g. the user-setting), and the values in >> the Accept-Language http header (which are from the browser or >> platform). > > All this is not going to help much because the main application is a > scientific search portal for books and articles with many users > searching cross-language. The most typical use case is a German user > searching multilingual. So we might even get the search multilingual, > e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for > Accept-headers or a language select field (would be left on "any" in > most cases). Other popular use cases are citations (in whatever > language) cut and pasted into the search field. > >> Then you expand your query for surfing waves (say) to: >> - phrase query: surfing waves exactly (^2.0) >> - two terms, no stemming: surfing waves (^1.5) >> - iterate through the languages and query for stemmed variants: >> - english: surf wav ^1.0 >> - german surfing wave ^0.9 >> - .... >> - then maybe even try the phonetic analyzer (matched in a separate >> field probably) > > This is an even more sophisticated variant of the multiple "OR" I came > up with. Oh well... > >> I think this is a common pattern on the web where the users, >> browsers, >> and servers are all somewhat multilingual. > > indeed and often users are not even aware of it, especially in a > scientific context they use their native tongue and English almost > interchangably -- and they expect the search engine to cope with it. > > I think the best would be to process the data according to its > language > but don't make any assumptions about the query language and I am > totally > lost how to get a clever schema.xml out of all this. > > Thanks everyone for listening and I am still open for good suggestions > to deal with this problem! > > -Michael |
|
|
Re: Preparing the ground for a real multilang indexThere is an alternative to knowing the language at query:
multiply-process for stems or lemmas of all the possible languages. This may well be a cure much worse than the disease. Yes, LI can sell you our lemma-production capability. --benson margulies basis technology On Tue, Jul 7, 2009 at 6:50 PM, Jan Høydahl<jh@...> wrote: > When using stemming, you have to know the query language. > For your project, perhaps you should look into switching to a lemmatizer > instead. I believe Lucid can provide integration with a commercial > lemmatizer. This way you can expand the document field itself and do not > need to know the query language. You may then want to do a copyfield from > all your text_<lang> -> text for convenient one-field-to-rule-them-all > search. > > -- > Jan Høydahl > Gründer & senior architect > Cominvent AS, Stabekk, Norway > www.cominvent.com > +20 100930908 > > On 3. juli. 2009, at 08.43, Michael Lackhoff wrote: > >> On 03.07.2009 00:49 Paul Libbrecht wrote: >> >> [I'll try to address the other responses as well] >> >>> I believe the proper way is for the server to compute a list of >>> accepted languages in order of preferences. >>> The web-platform language (e.g. the user-setting), and the values in >>> the Accept-Language http header (which are from the browser or >>> platform). >> >> All this is not going to help much because the main application is a >> scientific search portal for books and articles with many users >> searching cross-language. The most typical use case is a German user >> searching multilingual. So we might even get the search multilingual, >> e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for >> Accept-headers or a language select field (would be left on "any" in >> most cases). Other popular use cases are citations (in whatever >> language) cut and pasted into the search field. >> >>> Then you expand your query for surfing waves (say) to: >>> - phrase query: surfing waves exactly (^2.0) >>> - two terms, no stemming: surfing waves (^1.5) >>> - iterate through the languages and query for stemmed variants: >>> - english: surf wav ^1.0 >>> - german surfing wave ^0.9 >>> - .... >>> - then maybe even try the phonetic analyzer (matched in a separate >>> field probably) >> >> This is an even more sophisticated variant of the multiple "OR" I came >> up with. Oh well... >> >>> I think this is a common pattern on the web where the users, browsers, >>> and servers are all somewhat multilingual. >> >> indeed and often users are not even aware of it, especially in a >> scientific context they use their native tongue and English almost >> interchangably -- and they expect the search engine to cope with it. >> >> I think the best would be to process the data according to its language >> but don't make any assumptions about the query language and I am totally >> lost how to get a clever schema.xml out of all this. >> >> Thanks everyone for listening and I am still open for good suggestions >> to deal with this problem! >> >> -Michael > > |
|
|
Re: Preparing the ground for a real multilang indexOn 08.07.2009 00:50 Jan Høydahl wrote:
> itself and do not need to know the query language. You may then want > to do a copyfield from all your text_<lang> -> text for convenient one- > field-to-rule-them-all search. Would that really help? As I understand it, copyfield takes the raw, not yet analyzed field value. I cannot see yet the advantage of this "text"-field over the current situation with no text_<lang> fields at all. The copied-to text field has to be language agnostic with no stemming at all, so it would miss many hits. Or is there a way to combine many differently stemmed variants into one field to be able to search against all of them at once? That would be great indeed! -Michael |
|
|
Re: Preparing the ground for a real multilang indexCan't the copy field use a different analyzer?
Both for query and indexing? Otherwise you need to craft your own analyzer which reads the language from the field-name... there's several classes ready for this. paul Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit : > On 08.07.2009 00:50 Jan Høydahl wrote: > >> itself and do not need to know the query language. You may then want >> to do a copyfield from all your text_<lang> -> text for convenient >> one- >> field-to-rule-them-all search. > > Would that really help? As I understand it, copyfield takes the raw, > not > yet analyzed field value. I cannot see yet the advantage of this > "text"-field over the current situation with no text_<lang> fields > at all. > The copied-to text field has to be language agnostic with no > stemming at > all, so it would miss many hits. Or is there a way to combine many > differently stemmed variants into one field to be able to search > against > all of them at once? That would be great indeed! > > -Michael |
|
|
Re: Preparing the ground for a real multilang indexMichael, you're of course right, copyfield would copy from source.
The lack of built-in language awareness in Solr is unfortunate :( I have not tried Lucid's BasisTech lemmatizer implementation, but check with them whether they can support multi languages in the same field. -- Jan Høydahl On 8. juli. 2009, at 16.32, Paul Libbrecht wrote: > Can't the copy field use a different analyzer? > Both for query and indexing? > Otherwise you need to craft your own analyzer which reads the > language from the field-name... there's several classes ready for > this. > > paul > > Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit : > >> On 08.07.2009 00:50 Jan Høydahl wrote: >> >>> itself and do not need to know the query language. You may then want >>> to do a copyfield from all your text_<lang> -> text for convenient >>> one- >>> field-to-rule-them-all search. >> >> Would that really help? As I understand it, copyfield takes the >> raw, not >> yet analyzed field value. I cannot see yet the advantage of this >> "text"-field over the current situation with no text_<lang> fields >> at all. >> The copied-to text field has to be language agnostic with no >> stemming at >> all, so it would miss many hits. Or is there a way to combine many >> differently stemmed variants into one field to be able to search >> against >> all of them at once? That would be great indeed! >> >> -Michael > |
| Free embeddable forum powered by Nabble | Forum Help |