|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Indexing domain names?Hi,
How do I go about indexing domain names? I currently index the domain, but it only works if I put the exact full domain in. For example: site:www.youtube.com (this works) site:youtube.com (this doesn't work) I am using the StandardAnalyzer as most of the other fields being indexed are free form text. Currently the "site" field is stored and tokenized. As an additional improvement it would be even better if something like this worked: site:youtube.com/foo Cheers, Chris |
|
|
Re: Indexing domain names?> Hi,
> > How do I go about indexing domain names? I currently index > the domain, but > it only works if I put the exact full domain in. For > example: > > site:www.youtube.com (this works) > site:youtube.com (this doesn't work) > > I am using the StandardAnalyzer as most of the other fields > being indexed > are free form text. Currently the "site" field is stored > and tokenized. StandardTokenizer recognizes www.youtube.com and youtube.com as singe token. Therefore they do not match. You can use SimpleAnalyzer which uses LetterTokenizer. So www.youtube.com will be broken into three tokens: www youtube com youtube.com will be boreken into two tokens : youtube com By doing so site:youtube.com will bring you www.youtube.com But query site:youtube.com will also match a document like www.foo.com/youtube.com Note that LetterTokenizer uses Character.isLetter() method to break text. If your input has numbers like www.645cafe.com it will cause you problems. In your case it is better to extend CharTonizer and override protected boolean isTokenChar(char c) method according to your needs. > As an additional improvement it would be even better if > something like this > worked: > > site:youtube.com/foo To accomplish this, you can pre-process your queries to strip from first '/' char to the end. You need to convert youtube.com/foo/bla/bla to youtube.com. You can do it in a TokenFilter along with KeywordTokenizer with writing custom code. Hope this helps. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@... For additional commands, e-mail: java-user-help@... |
|
|
Re: Indexing domain names?<<<I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. >>> If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The snippet above makes me wonder if you've seen this class...... Best Erick On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@...> wrote: > > Hi, > > > > How do I go about indexing domain names? I currently index > > the domain, but > > it only works if I put the exact full domain in. For > > example: > > > > site:www.youtube.com (this works) > > site:youtube.com (this doesn't work) > > > > I am using the StandardAnalyzer as most of the other fields > > being indexed > > are free form text. Currently the "site" field is stored > > and tokenized. > > StandardTokenizer recognizes www.youtube.com and youtube.com as singe > token. Therefore they do not match. You can use SimpleAnalyzer which uses > LetterTokenizer. So > > www.youtube.com will be broken into three tokens: www youtube com > youtube.com will be boreken into two tokens : youtube com > > By doing so site:youtube.com will bring you www.youtube.com > > But query site:youtube.com will also match a document like > www.foo.com/youtube.com > > Note that LetterTokenizer uses Character.isLetter() method to break text. > If your input has numbers like www.645cafe.com it will cause you problems. > > In your case it is better to extend CharTonizer and override protected > boolean isTokenChar(char c) method according to your needs. > > > As an additional improvement it would be even better if > > something like this > > worked: > > > > site:youtube.com/foo > > To accomplish this, you can pre-process your queries to strip from first > '/' char to the end. You need to convert youtube.com/foo/bla/bla to > youtube.com. > You can do it in a TokenFilter along with KeywordTokenizer with writing > custom code. > > Hope this helps. > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@... > For additional commands, e-mail: java-user-help@... > > |
|
|
Re: Indexing domain names?Thanks for the tips guys, got it working now.
Cheers, Chris On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerickson@...>wrote: > <<<I am using the StandardAnalyzer as most of the other fields being > indexed > are free form text. >>> > > If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The > snippet > above makes me wonder if you've seen this class...... > > Best > Erick > > On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@...> wrote: > > > > Hi, > > > > > > How do I go about indexing domain names? I currently index > > > the domain, but > > > it only works if I put the exact full domain in. For > > > example: > > > > > > site:www.youtube.com (this works) > > > site:youtube.com (this doesn't work) > > > > > > I am using the StandardAnalyzer as most of the other fields > > > being indexed > > > are free form text. Currently the "site" field is stored > > > and tokenized. > > > > StandardTokenizer recognizes www.youtube.com and youtube.com as singe > > token. Therefore they do not match. You can use SimpleAnalyzer which uses > > LetterTokenizer. So > > > > www.youtube.com will be broken into three tokens: www youtube com > > youtube.com will be boreken into two tokens : youtube com > > > > By doing so site:youtube.com will bring you www.youtube.com > > > > But query site:youtube.com will also match a document like > > www.foo.com/youtube.com > > > > Note that LetterTokenizer uses Character.isLetter() method to break text. > > If your input has numbers like www.645cafe.com it will cause you > problems. > > > > In your case it is better to extend CharTonizer and override protected > > boolean isTokenChar(char c) method according to your needs. > > > > > As an additional improvement it would be even better if > > > something like this > > > worked: > > > > > > site:youtube.com/foo > > > > To accomplish this, you can pre-process your queries to strip from first > > '/' char to the end. You need to convert youtube.com/foo/bla/bla to > > youtube.com. > > You can do it in a TokenFilter along with KeywordTokenizer with writing > > custom code. > > > > Hope this helps. > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@... > > For additional commands, e-mail: java-user-help@... > > > > > |
| Free embeddable forum powered by Nabble | Forum Help |