Re: Preparing the ground for a real multilang index
On 03.07.2009 00:49 Paul Libbrecht wrote:
[I'll try to address the other responses as well]
> I believe the proper way is for the server to compute a list of
> accepted languages in order of preferences.
> The web-platform language (e.g. the user-setting), and the values in
> the Accept-Language http header (which are from the browser or
> platform).
All this is not going to help much because the main application is a
scientific search portal for books and articles with many users
searching cross-language. The most typical use case is a German user
searching multilingual. So we might even get the search multilingual,
e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
Accept-headers or a language select field (would be left on "any" in
most cases). Other popular use cases are citations (in whatever
language) cut and pasted into the search field.
> Then you expand your query for surfing waves (say) to:
> - phrase query: surfing waves exactly (^2.0)
> - two terms, no stemming: surfing waves (^1.5)
> - iterate through the languages and query for stemmed variants:
> - english: surf wav ^1.0
> - german surfing wave ^0.9
> - ....
> - then maybe even try the phonetic analyzer (matched in a separate
> field probably)
This is an even more sophisticated variant of the multiple "OR" I came
up with. Oh well...
> I think this is a common pattern on the web where the users, browsers,
> and servers are all somewhat multilingual.
indeed and often users are not even aware of it, especially in a
scientific context they use their native tongue and English almost
interchangably -- and they expect the search engine to cope with it.
I think the best would be to process the data according to its language
but don't make any assumptions about the query language and I am totally
lost how to get a clever schema.xml out of all this.
Thanks everyone for listening and I am still open for good suggestions
to deal with this problem!
-Michael