I'm pretty new to solr; my apologies if this is a naive question, and my
apologies for the verbosity:
I'd like to take keywords in my documents, and expand them as synonyms; for
example, if the document gets annotated with a keyword of 'sf', I'd like
that to expand to 'San Francisco'. (San Francisco,San Fran,SF is a line in
my synonyms.txt file).
But I also want to be able to display facets with counts for these keywords;
I'd like them to be suitable for display.
So, if I define the keywords field as 'text', I use the following pipeline
(from my schema.xml):
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100"> <analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/> <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true" /> <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/> <filter
class="solr.LowerCaseFilterFactory"/> <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer> <analyzer type="query"> <tokenizer
class="solr.WhitespaceTokenizerFactory"/> <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/> <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/> <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/> <filter
class="solr.LowerCaseFilterFactory"/> <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer> </fieldType>
Faceting on this field, I get return values (when I query specifically
for the single document in question):
<lst name="Keywords">
<int name="fran">1</int>
<int name="francisco">1</int>
<int name="san">1</int>
<int name="sf">1</int>
</lst>
I've also done a copyfield to a 'KeywordsString' field, which is
defined as "string". i.e.
<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
Faceting on *that* field (when querying for just this 1 document,
which has a keyword of 'sf'), results in:
<lst name="KeywordsString">
<int name="sf">1</int>
</lst>
I guess what I'd like to see is the ability to stamp keywords like
'sf', 'san fran', 'san francisco', and 'mlb' (with a synonyms.txt file
entry of mlb => Major League Baseball, and see all the documents that
are inscribed with all those synonym variants, come back as:
<lst name="KeywordsString">
<int name="San Francisco">1</int>
<int name="Major League Baseball">1</int>
</lst>
But, I don't know how to define a processing pipeline that expands
synonyms that doesn't tokenize them, breaking 'San Francisco' into
'san' and 'francisco', and presenting those as separate facets.
Thanks for any help,
Don