Multifield query parser and phrase query behaviour from 1.3 to 1.4

View: New views
6 Messages — Rating Filter:   Alert me  

Multifield query parser and phrase query behaviour from 1.3 to 1.4

by Jérôme Etévé :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi All,
 I'm using a multified query parser to generated weighted queries
across different fields.

For instance, perl developer gives me:
+(title:perl^10.0 keywords:perl company:perl^3.0)
+(title:developer^10.0 keywords:developer company:developer^3.0)

Either in solr 1.3 or solr 1.4 (from 12 oct 2009), a query like
"d'affaire" gives me:
title:"d affaire"^10.0 keywords:"d affaire" company:"d affaire"^3.0

nb: "d" is not a stopword

That's the first thing I don't get, since "d'affaire" is parsed as two
separate tokens 'd' and 'affaire' , why these phrase queries appear?

When I use the analysis interface of solr, "d'affaire" gives (for
query or indexing, since the analyzer is the same):
term position 1 2
term text d affaire
term type word word
source start,end 0,1 2,9

You can't see it in this email, but 'd' and 'affaire' are both purple,
indicating a match with the query tokens.

I don't really get why these two tokens are subsequently put together
in a phrase query.

In solr 1.3, it didn't seem to be a problem though. title:"d affaire"
matches document where title contains "d'affaire" and all is fine.
That's the behaviour we should expect since the title field uses
exactly the same analyzer at index and query time.

Since I'm using solr 1.4, title:"d affaire" does not give any results back.

Is there any behaviour change that could be responsible for this, and
what's the correct way to fix this?

Thanks for your help.

Jerome.

--
Jerome Eteve.
http://www.eteve.net
jerome@...

Re: Multifield query parser and phrase query behaviour from 1.3 to 1.4

by Yonik Seeley-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 8:44 AM, Jérôme Etévé <jerome.eteve@...> wrote:
> I don't really get why these two tokens are subsequently put together
> in a phrase query.

That's the way the Lucene query parser has always worked... phrase
queries are made if multiple tokens are produced from one field query.

> In solr 1.3, it didn't seem to be a problem though. title:"d affaire"
> matches document where title contains "d'affaire" and all is fine.

This should not have changed between 1.3 and 1.4...
What's the fieldType and it's definition for your title field?

-Yonik
http://www.lucidimagination.com

Re: Multifield query parser and phrase query behaviour from 1.3 to 1.4

by Jérôme Etévé :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hum,
 That's probably because of our own customized types/tokenizers/filters.

I tried reindexing and querying our data using the default solr type
'textgen' and it works fine.

I need to investigate which features of the new lucene 2.9 API is not
implemented in our own tokenizers etc...

Thanks.

Jerome.

2009/10/27 Yonik Seeley <yonik@...>:

> On Tue, Oct 27, 2009 at 8:44 AM, Jérôme Etévé <jerome.eteve@...> wrote:
>> I don't really get why these two tokens are subsequently put together
>> in a phrase query.
>
> That's the way the Lucene query parser has always worked... phrase
> queries are made if multiple tokens are produced from one field query.
>
>> In solr 1.3, it didn't seem to be a problem though. title:"d affaire"
>> matches document where title contains "d'affaire" and all is fine.
>
> This should not have changed between 1.3 and 1.4...
> What's the fieldType and it's definition for your title field?
>
> -Yonik
> http://www.lucidimagination.com
>



--
Jerome Eteve.
http://www.eteve.net
jerome@...

Re: Multifield query parser and phrase query behaviour from 1.3 to 1.4

by Jérôme Etévé :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Actually here is the difference between the textgen analysis pipeline and our:

For the phrase "ingenieur d'affaire senior" ,
Our pipeline gives right after our tokenizer:

term position 1 2 3 4
term text ingenieur d affaire senior

'd' and 'affaire' are separated as different tokens straight away. Our
filters have no later effect for this phrase.

* The textgen pipeline uses a whitespace tokenizer, so it gives first:
term position 1 2 3
term text ingenieur d'affaire senior
term type word word word
source start,end 0,9 10,19 20,26

* Then a word delimiter filter splits the token "d'affaire" (and
generate the concatenation):
erm position 1 2 3 4
term text ingenieur d affaire senior
daffaire
term type word word word word
word
source start,end 0,9 10,11 12,19 20,26
10,19


Could you see a reason why title:"d affaire" works with textgen but
not with our type?

Thanks!

Jerome.


2009/10/27 Jérôme Etévé <jerome.eteve@...>:

> Hum,
>  That's probably because of our own customized types/tokenizers/filters.
>
> I tried reindexing and querying our data using the default solr type
> 'textgen' and it works fine.
>
> I need to investigate which features of the new lucene 2.9 API is not
> implemented in our own tokenizers etc...
>
> Thanks.
>
> Jerome.
>
> 2009/10/27 Yonik Seeley <yonik@...>:
>> On Tue, Oct 27, 2009 at 8:44 AM, Jérôme Etévé <jerome.eteve@...> wrote:
>>> I don't really get why these two tokens are subsequently put together
>>> in a phrase query.
>>
>> That's the way the Lucene query parser has always worked... phrase
>> queries are made if multiple tokens are produced from one field query.
>>
>>> In solr 1.3, it didn't seem to be a problem though. title:"d affaire"
>>> matches document where title contains "d'affaire" and all is fine.
>>
>> This should not have changed between 1.3 and 1.4...
>> What's the fieldType and it's definition for your title field?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
> --
> Jerome Eteve.
> http://www.eteve.net
> jerome@...
>



--
Jerome Eteve.
http://www.eteve.net
jerome@...

Re: Multifield query parser and phrase query behaviour from 1.3 to 1.4

by Jérôme Etévé :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mea maxima culpa,

I had foolishly set the option  omitTermFreqAndPositions="false" in an
attempt to save space.
It works when this is set to 'true'.

However, even when it's set to 'false' , the highlighting of a field
continues to work even if the search doesn't.
Does the highlighter use a different strategy to match the query terms
in the fields?

Cheers!

Jerome.

2009/10/27 Jérôme Etévé <jerome.eteve@...>:

> Actually here is the difference between the textgen analysis pipeline and our:
>
> For the phrase "ingenieur d'affaire senior" ,
> Our pipeline gives right after our tokenizer:
>
> term position   1       2       3       4
> term text       ingenieur       d       affaire senior
>
> 'd' and 'affaire' are separated as different tokens straight away. Our
> filters have no later effect for this phrase.
>
> * The textgen pipeline uses a whitespace tokenizer, so it gives first:
> term position   1       2       3
> term text       ingenieur       d'affaire       senior
> term type       word    word    word
> source start,end        0,9     10,19   20,26
>
> * Then a word delimiter filter splits the token "d'affaire" (and
> generate the concatenation):
> erm position    1       2       3       4
> term text       ingenieur       d       affaire senior
> daffaire
> term type       word    word    word    word
> word
> source start,end        0,9     10,11   12,19   20,26
> 10,19
>
>
> Could you see a reason why title:"d affaire" works with textgen but
> not with our type?

Re: Multifield query parser and phrase query behaviour from 1.3 to 1.4

by hossman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


: However, even when it's set to 'false' , the highlighting of a field
: continues to work even if the search doesn't.
: Does the highlighter use a different strategy to match the query terms
: in the fields?

if it has term vectors, it uses them, otherwise it re analyzes the stored
fields.


-Hoss