expand synonyms without tokenizing stream?

View: New views
2 Messages — Rating Filter:   Alert me  

expand synonyms without tokenizing stream?

by doncl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm pretty new to solr; my apologies if this is a naive question, and my
apologies for the verbosity:
I'd like to take keywords in my documents, and expand them as synonyms; for
example, if the document gets annotated with a keyword of 'sf', I'd like
that to expand to 'San Francisco'.  (San Francisco,San Fran,SF is a line in
my synonyms.txt file).

But I also want to be able to display facets with counts for these keywords;
I'd like them to be suitable for display.

So, if I define the keywords field as 'text', I use the following pipeline
(from my schema.xml):

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">      <analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>        <filter
class="solr.StopFilterFactory"                ignoreCase="true"
        words="stopwords.txt"
enablePositionIncrements="true"                />        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>      <analyzer type="query">        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>        <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>    </fieldType>


Faceting on this field, I get return values (when I query specifically
for the single document in question):

      <lst name="Keywords">
        <int name="fran">1</int>
        <int name="francisco">1</int>
        <int name="san">1</int>
        <int name="sf">1</int>
      </lst>

I've also done a copyfield to a 'KeywordsString' field, which is
defined as "string". i.e.

<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>

Faceting on *that* field (when querying for just this 1 document,
which has a keyword of 'sf'), results in:

      <lst name="KeywordsString">
        <int name="sf">1</int>
      </lst>

I guess what I'd like to see is the ability to stamp keywords like
'sf', 'san fran', 'san francisco', and 'mlb' (with a synonyms.txt file
entry of mlb => Major League Baseball, and see all the documents that
are inscribed with all those synonym variants, come back as:

      <lst name="KeywordsString">
        <int name="San Francisco">1</int>

       <int name="Major League Baseball">1</int>

</lst>


But, I don't know how to define a processing pipeline that expands
synonyms that doesn't tokenize them, breaking 'San Francisco' into
'san' and 'francisco', and presenting those as separate facets.

Thanks for any help,

Don

Re: expand synonyms without tokenizing stream?

by hossman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


: I'd like to take keywords in my documents, and expand them as synonyms; for
: example, if the document gets annotated with a keyword of 'sf', I'd like
: that to expand to 'San Francisco'.  (San Francisco,San Fran,SF is a line in
: my synonyms.txt file).
:
: But I also want to be able to display facets with counts for these keywords;
: I'd like them to be suitable for display.
        ...

: I've also done a copyfield to a 'KeywordsString' field, which is
: defined as "string". i.e.
:
: <fieldType name="string" class="solr.StrField" sortMissingLast="true"
: omitNorms="true"/>

It sounds like you are on the right track ... the key isearch on a field
with synonyms expanded (at index time) and facet on a field with synonyms
collapse (at index time)

try chagning the fieldtype you facet on to be a TextField with the
KeywordTokenizer, and then use the SynonymFilter on it ... that should
work (but i haven't tried it)

if you format your synonyms file properly (commas instead of arrows), you
can use the exact smae file for both fieldtypes, even though one will
expand, and the other will collapse.


-Hoss