leading and trailing wildcard query

View: New views
20 Messages — Rating Filter:   Alert me  

leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've scoured the archives and JIRA , but the answer to my question is just
not clear to me.

With all the new Solr 1.4 features, is there any way  to do a leading and
trailing wildcard query on an *untokenized* field?

e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx

Yes, I know how expensive such a query would be, but we have the user
requirement, nonetheless.

If not, any suggestions on how to implement a custom solution using Solr?
Using an external data structure?

--
A. Steven Anderson

Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

No thoughts on this? Really!?

I would hate to admit to my Oracle DBE that Solr can't be customized to do a
common query that a relational database can do. :-(


On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
a.steven.anderson@...> wrote:

> I've scoured the archives and JIRA , but the answer to my question is just
> not clear to me.
>
> With all the new Solr 1.4 features, is there any way  to do a leading and
> trailing wildcard query on an *untokenized* field?
>
> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>
> Yes, I know how expensive such a query would be, but we have the user
> requirement, nonetheless.
>
> If not, any suggestions on how to implement a custom solution using Solr?
> Using an external data structure?
>
>
--
A. Steven Anderson

Re: leading and trailing wildcard query

by Otis Gospodnetic :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The guilt trick is not the best thing to try on public mailing lists. :)

The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one.
The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <a.steven.anderson@...>
> To: solr-user@...
> Sent: Thu, November 5, 2009 3:04:32 PM
> Subject: Re: leading and trailing wildcard query
>
> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> a.steven.anderson@...> wrote:
>
> > I've scoured the archives and JIRA , but the answer to my question is just
> > not clear to me.
> >
> > With all the new Solr 1.4 features, is there any way  to do a leading and
> > trailing wildcard query on an *untokenized* field?
> >
> > e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
> >
> > Yes, I know how expensive such a query would be, but we have the user
> > requirement, nonetheless.
> >
> > If not, any suggestions on how to implement a custom solution using Solr?
> > Using an external data structure?
> >
> >
> --
> A. Steven Anderson


Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
> The guilt trick is not the best thing to try on public mailing lists. :)
>

Point taken, although not my intention.  I guess I have been spoiled by
quick replies and was getting to think it was a stupid question.

Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make
this work. ;-)

We've basically relegated Oracle to handling ingest from which we index Solr
and provide all search features.  I'd hate to have to succumb to using
Oracle to service this one special query.


> The first thing that popped to my mind is to use 2 fields, where the second
> one contains the desrever string of the first one.
>

Please elaborate. What do you mean by *desrever* string?


> The second idea is to use n-grams (if it's OK to tokenize), more
> specifically edge n-grams.
>

Well, that's the problem.  The field may have non-Latin characters that may
not have whitespace nor punctuation.


--
A. Steven Anderson

RE: leading and trailing wildcard query

by Bernadette Houghton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've just set up something similar (much thanks to Avesh!)-

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

<fieldType name="doubleedgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>
.
.
   <field name="beginswith" type="edgytext" indexed="true" stored="false" multiValued="true"/>
   <field name="contains" type="doubleedgytext" indexed="true" stored="false" multiValued="true"/>
.
.
   <!-- Copy for BEGINSWITH search -->
   <copyField source="content" dest="beginswith"/>
   <copyField source="*_t" dest="beginswith"/>
   <copyField source="*_mt" dest="beginswith"/>
   
   <!-- Copy for CONTAINS search -->
   <copyField source="content" dest="contains"/>
   <copyField source="*_t" dest="contains"/>
   <copyField source="*_mt" dest="contains"/>

bern

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@...]
Sent: Friday, 6 November 2009 9:13 AM
To: solr-user@...
Subject: Re: leading and trailing wildcard query

The guilt trick is not the best thing to try on public mailing lists. :)

The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one.
The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <a.steven.anderson@...>
> To: solr-user@...
> Sent: Thu, November 5, 2009 3:04:32 PM
> Subject: Re: leading and trailing wildcard query
>
> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> a.steven.anderson@...> wrote:
>
> > I've scoured the archives and JIRA , but the answer to my question is just
> > not clear to me.
> >
> > With all the new Solr 1.4 features, is there any way  to do a leading and
> > trailing wildcard query on an *untokenized* field?
> >
> > e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
> >
> > Yes, I know how expensive such a query would be, but we have the user
> > requirement, nonetheless.
> >
> > If not, any suggestions on how to implement a custom solution using Solr?
> > Using an external data structure?
> >
> >
> --
> A. Steven Anderson


Re: leading and trailing wildcard query

by Walter Underwood-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard?

It can be really slow, what an RDBMS person would call a full table  
scan.

There is an open bug to make that settable in a config file, but this  
is a pretty tiny change to the source.

    http://issues.apache.org/jira/browse/SOLR-218

wunder

On Nov 5, 2009, at 2:13 PM, Otis Gospodnetic wrote:

> The guilt trick is not the best thing to try on public mailing  
> lists. :)
>
> The first thing that popped to my mind is to use 2 fields, where the  
> second one contains the desrever string of the first one.
> The second idea is to use n-grams (if it's OK to tokenize), more  
> specifically edge n-grams.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: A. Steven Anderson <a.steven.anderson@...>
>> To: solr-user@...
>> Sent: Thu, November 5, 2009 3:04:32 PM
>> Subject: Re: leading and trailing wildcard query
>>
>> No thoughts on this? Really!?
>>
>> I would hate to admit to my Oracle DBE that Solr can't be  
>> customized to do a
>> common query that a relational database can do. :-(
>>
>>
>> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
>> a.steven.anderson@...> wrote:
>>
>>> I've scoured the archives and JIRA , but the answer to my question  
>>> is just
>>> not clear to me.
>>>
>>> With all the new Solr 1.4 features, is there any way  to do a  
>>> leading and
>>> trailing wildcard query on an *untokenized* field?
>>>
>>> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>>>
>>> Yes, I know how expensive such a query would be, but we have the  
>>> user
>>> requirement, nonetheless.
>>>
>>> If not, any suggestions on how to implement a custom solution  
>>> using Solr?
>>> Using an external data structure?
>>>
>>>
>> --
>> A. Steven Anderson
>


Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the solution, but could you elaborate on how it would find
something like *abc* in a field that contains xxxxabcxxxx.

Steve

On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
bernadette.houghton@...> wrote:

> I've just set up something similar (much thanks to Avesh!)-
>
> <fieldType name="edgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> maxGramSize="25" />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> <fieldType name="doubleedgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25"
> />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
> .
> .
>   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="contains" type="doubleedgytext" indexed="true"
> stored="false" multiValued="true"/>
> .
> .
>   <!-- Copy for BEGINSWITH search -->
>   <copyField source="content" dest="beginswith"/>
>   <copyField source="*_t" dest="beginswith"/>
>   <copyField source="*_mt" dest="beginswith"/>
>
>   <!-- Copy for CONTAINS search -->
>   <copyField source="content" dest="contains"/>
>   <copyField source="*_t" dest="contains"/>
>   <copyField source="*_mt" dest="contains"/>
>
> bern

Re: leading and trailing wildcard query

by Erick Erickson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Because that is the semantics of Solr/Lucene wildcard syntax. * stands for
"any number of any character". Basically, it enumerates all the terms in the
field for all the documents and assembles a list of all of them that contain
the
substring "abc" and uses that as one of the clauses of your search...

Best
Erick

On Thu, Nov 5, 2009 at 6:07 PM, A. Steven Anderson <
a.steven.anderson@...> wrote:

> Thanks for the solution, but could you elaborate on how it would find
> something like *abc* in a field that contains xxxxabcxxxx.
>
> Steve
>
> On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
> bernadette.houghton@...> wrote:
>
> > I've just set up something similar (much thanks to Avesh!)-
> >
> > <fieldType name="edgytext" class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> > maxGramSize="25" />
> >  </analyzer>
> >  <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> >
> > <fieldType name="doubleedgytext" class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >   <filter class="solr.NGramFilterFactory" minGramSize="5"
> maxGramSize="25"
> > />
> >  </analyzer>
> >  <analyzer type="query">
> >   <tokenizer class="solr.KeywordTokenizerFactory"/>
> >   <filter class="solr.LowerCaseFilterFactory"/>
> >  </analyzer>
> > </fieldType>
> > .
> > .
> >   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> > multiValued="true"/>
> >   <field name="contains" type="doubleedgytext" indexed="true"
> > stored="false" multiValued="true"/>
> > .
> > .
> >   <!-- Copy for BEGINSWITH search -->
> >   <copyField source="content" dest="beginswith"/>
> >   <copyField source="*_t" dest="beginswith"/>
> >   <copyField source="*_mt" dest="beginswith"/>
> >
> >   <!-- Copy for CONTAINS search -->
> >   <copyField source="content" dest="contains"/>
> >   <copyField source="*_t" dest="contains"/>
> >   <copyField source="*_mt" dest="contains"/>
> >
> > bern
>

Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard?


Good question.  Anyone?


> It can be really slow, what an RDBMS person would call a full table scan.


Understood.


> There is an open bug to make that settable in a config file, but this is a
> pretty tiny change to the source.
>   http://issues.apache.org/jira/browse/SOLR-218
>

Unfortunately, we can only use official releases (not even snapshots) since
it's a government-related project.

--
A. Steven Anderson

RE: leading and trailing wildcard query

by Bernadette Houghton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed.

bern

-----Original Message-----
From: A. Steven Anderson [mailto:a.steven.anderson@...]
Sent: Friday, 6 November 2009 10:08 AM
To: solr-user@...
Subject: Re: leading and trailing wildcard query

Thanks for the solution, but could you elaborate on how it would find
something like *abc* in a field that contains xxxxabcxxxx.

Steve

On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
bernadette.houghton@...> wrote:

> I've just set up something similar (much thanks to Avesh!)-
>
> <fieldType name="edgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
> maxGramSize="25" />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> <fieldType name="doubleedgytext" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.NGramFilterFactory" minGramSize="5" maxGramSize="25"
> />
>  </analyzer>
>  <analyzer type="query">
>   <tokenizer class="solr.KeywordTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldType>
> .
> .
>   <field name="beginswith" type="edgytext" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="contains" type="doubleedgytext" indexed="true"
> stored="false" multiValued="true"/>
> .
> .
>   <!-- Copy for BEGINSWITH search -->
>   <copyField source="content" dest="beginswith"/>
>   <copyField source="*_t" dest="beginswith"/>
>   <copyField source="*_mt" dest="beginswith"/>
>
>   <!-- Copy for CONTAINS search -->
>   <copyField source="content" dest="contains"/>
>   <copyField source="*_t" dest="contains"/>
>   <copyField source="*_mt" dest="contains"/>
>
> bern

Re: leading and trailing wildcard query

by Walter Underwood-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ah. With that restriction, it is impossible.

If it is OK to pay Lucid to make a one-line change, you might be able  
to do it. Otherwise, get ready to spend a lot of money for a search  
engine.

wunder

On Nov 5, 2009, at 3:18 PM, A. Steven Anderson wrote:

> Unfortunately, we can only use official releases (not even  
> snapshots) since
> it's a government-related project.
>
> --
> A. Steven Anderson


Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the
> doubleedgytext, and would be retrievable by a query such as contains:abc.
> Note that you can set the max and minimum size of strings that get indexed.
>

Excellent!  Just to clarify though, NGramFilterFactor is a Solr 1.4 feature
only, correct?

--
A. Steven Anderson

Re: leading and trailing wildcard query

by Walter Underwood-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Note that N-grams are limited to specific string lengths. I presume  
that you need to search for arbitrary strings, not just three-letter  
ones.

wunder

On Nov 5, 2009, at 3:23 PM, Bernadette Houghton wrote:

> Hi Steve, a query such as *abc* would need the NGramFilterFactor,  
> hence the doubleedgytext, and would be retrievable by a query such  
> as contains:abc. Note that you can set the max and minimum size of  
> strings that get indexed.
>
> bern
>
> -----Original Message-----
> From: A. Steven Anderson [mailto:a.steven.anderson@...]
> Sent: Friday, 6 November 2009 10:08 AM
> To: solr-user@...
> Subject: Re: leading and trailing wildcard query
>
> Thanks for the solution, but could you elaborate on how it would find
> something like *abc* in a field that contains xxxxabcxxxx.
>
> Steve
>
> On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton <
> bernadette.houghton@...> wrote:
>
>> I've just set up something similar (much thanks to Avesh!)-
>>
>> <fieldType name="edgytext" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>>  <filter class="solr.EdgeNGramFilterFactory" minGramSize="5"
>> maxGramSize="25" />
>> </analyzer>
>> <analyzer type="query">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> <fieldType name="doubleedgytext" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>>  <filter class="solr.NGramFilterFactory" minGramSize="5"  
>> maxGramSize="25"
>> />
>> </analyzer>
>> <analyzer type="query">
>>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> .
>> .
>>  <field name="beginswith" type="edgytext" indexed="true"  
>> stored="false"
>> multiValued="true"/>
>>  <field name="contains" type="doubleedgytext" indexed="true"
>> stored="false" multiValued="true"/>
>> .
>> .
>>  <!-- Copy for BEGINSWITH search -->
>>  <copyField source="content" dest="beginswith"/>
>>  <copyField source="*_t" dest="beginswith"/>
>>  <copyField source="*_mt" dest="beginswith"/>
>>
>>  <!-- Copy for CONTAINS search -->
>>  <copyField source="content" dest="contains"/>
>>  <copyField source="*_t" dest="contains"/>
>>  <copyField source="*_mt" dest="contains"/>
>>
>> bern
>


Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Ah. With that restriction, it is impossible.
> If it is OK to pay Lucid to make a one-line change, you might be able to do
> it. Otherwise, get ready to spend a lot of money for a search engine.
>

Well, now that Lucid is getting In-Q-Tel $$$, they will soon learn that
officially releases are all that matters, and 12-18 month release cycles are
not acceptable. ;-)

--
A. Steven Anderson

Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Note that N-grams are limited to specific string lengths. I presume that
> you need to search for arbitrary strings, not just three-letter ones.
>

Understood, but that is a limitation that we can live with.

Thanks!
--
A. Steven Anderson

RE: leading and trailing wildcard query

by Bernadette Houghton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Not sure what version it was supported from, but we're on 1.3.
bern

-----Original Message-----
From: A. Steven Anderson [mailto:a.steven.anderson@...]
Sent: Friday, 6 November 2009 10:25 AM
To: solr-user@...
Subject: Re: leading and trailing wildcard query

> Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the
> doubleedgytext, and would be retrievable by a query such as contains:abc.
> Note that you can set the max and minimum size of strings that get indexed.
>

Excellent!  Just to clarify though, NGramFilterFactor is a Solr 1.4 feature
only, correct?

--
A. Steven Anderson

Re: leading and trailing wildcard query

by A. Steven Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Not sure what version it was supported from, but we're on 1.3.


Really!? Great answer!

Thanks!
--
A. Steven Anderson

Re: leading and trailing wildcard query

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A. Steven Anderson wrote:

> No thoughts on this? Really!?
>
> I would hate to admit to my Oracle DBE that Solr can't be customized to do a
> common query that a relational database can do. :-(
>
>
> On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson <
> a.steven.anderson@...> wrote:
>
>> I've scoured the archives and JIRA , but the answer to my question is just
>> not clear to me.
>>
>> With all the new Solr 1.4 features, is there any way  to do a leading and
>> trailing wildcard query on an *untokenized* field?
>>
>> e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx
>>
>> Yes, I know how expensive such a query would be, but we have the user
>> requirement, nonetheless.
>>
>> If not, any suggestions on how to implement a custom solution using Solr?
>> Using an external data structure?

You can use ReversedWildcardFilterFactory that creates additional tokens
(in your case, a single additional token :) ) that is reversed, _and_
also triggers the setAllowLeadingWildcards in the QueryParser - won't
help much with the performance though, due to the trailing wildcard in
your original query. Please see the discussion in SOLR-1321 (this will
be available in 1.4 but it should be easy to patch 1.3 to use it).

If you really need to support such queries efficiently you should
implement a full permu-term indexing, i.e. a token filter that rotates
tokens and adds all rotations (with a special marker to mark the
beginning of the word), and a query plugin that detects such query terms
and rotates the query term appropriately.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: leading and trailing wildcard query

by Otis Gospodnetic :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Please elaborate. What do you mean by *desrever* string?

Try reading in reverse ;).

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: A. Steven Anderson <a.steven.anderson@...>
> To: solr-user@...
> Sent: Thu, November 5, 2009 5:23:48 PM
> Subject: Re: leading and trailing wildcard query
>
> >
> > The guilt trick is not the best thing to try on public mailing lists. :)
> >
>
> Point taken, although not my intention.  I guess I have been spoiled by
> quick replies and was getting to think it was a stupid question.
>
> Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make
> this work. ;-)
>
> We've basically relegated Oracle to handling ingest from which we index Solr
> and provide all search features.  I'd hate to have to succumb to using
> Oracle to service this one special query.
>
>
> > The first thing that popped to my mind is to use 2 fields, where the second
> > one contains the desrever string of the first one.
> >
>
> Please elaborate. What do you mean by *desrever* string?
>
>
> > The second idea is to use n-grams (if it's OK to tokenize), more
> > specifically edge n-grams.
> >
>
> Well, that's the problem.  The field may have non-Latin characters that may
> not have whitespace nor punctuation.
>
>
> --
> A. Steven Anderson


Re: leading and trailing wildcard query

by Chantal Ackermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Just for the records - this works like a charm:

.../select?q=*potter*&qt=dismax

<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">93</int>

<lst name="params">
<str name="q">*potter*</str>
<str name="qt">dismax</str>
</lst>
</lst>

<result name="response" numFound="572" start="0" maxScore="5.3375173">
...
<str name="title">L'année où on a découvert «Harry Potter» au cinéma</str>
...

        <requestHandler name="dismax" class="solr.DisMaxRequestHandler">
                <lst name="defaults">
                        <str name="echoParams">explicit</str>
                        <float name="tie">0.01</float>
                        <str name="qf"> all_text_de^0.5 all_text_en^0.5 all_text_es^0.5
all_text_fr^0.5 all_text_it^0.5 all_text_nl^0.5 all_text_nolang^0.5
channel_name_tokens^1.0 role_tokens^1.0 participant_tokens^1.0</str>
                        <str name="pf"> title_de^2 title_en^2 title_es^2 title_fr^2
title_it^2 title_nl^2 title_nolang^2 channel_name_tokens^2 role_tokens^2
participant_tokens^2</str>
                        </str-->
                        <str name="fl"> *,score </str>
                        <str name="mm"> 2<-1 5<80%</str>
                        <int name="ps">100</int>
                        <str name="q.alt">*:*</str>
      </lst>
        </requestHandler>

And the funny thing: ReversedWildcardFilterFactory is still commented
out (I didn't remember I never reactivated it). And NGram was never part
of my schema.

Happy user of 1.4RC - I'm sure our milestones won't beat the SOLR 1.4
release date.

Cheers,
Chantal