Indexing domain names?

View: New views
4 Messages — Rating Filter:   Alert me  

Indexing domain names?

by Chris Were :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

How do I go about indexing domain names? I currently index the domain, but
it only works if I put the exact full domain in. For example:

site:www.youtube.com (this works)
site:youtube.com (this doesn't work)

I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. Currently the "site" field is stored and tokenized.

As an additional improvement it would be even better if something like this
worked:

site:youtube.com/foo

Cheers,
Chris

Re: Indexing domain names?

by Ahmet Arslan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Hi,
>
> How do I go about indexing domain names? I currently index
> the domain, but
> it only works if I put the exact full domain in. For
> example:
>
> site:www.youtube.com (this works)
> site:youtube.com (this doesn't work)
>
> I am using the StandardAnalyzer as most of the other fields
> being indexed
> are free form text. Currently the "site" field is stored
> and tokenized.

StandardTokenizer recognizes www.youtube.com and youtube.com as singe token. Therefore they do not match. You can use SimpleAnalyzer which uses LetterTokenizer. So

www.youtube.com will be broken into three tokens: www youtube com
youtube.com     will be boreken into two tokens : youtube com

By doing so site:youtube.com will bring you www.youtube.com

But query site:youtube.com will also match a document like www.foo.com/youtube.com

Note that LetterTokenizer uses Character.isLetter() method to break text. If your input has numbers like www.645cafe.com it will cause you problems.

In your case it is better to extend CharTonizer and override protected boolean isTokenChar(char c) method according to your needs.

> As an additional improvement it would be even better if
> something like this
> worked:
>
> site:youtube.com/foo

To accomplish this, you can pre-process your queries to strip from first '/' char to the end. You need to convert youtube.com/foo/bla/bla to youtube.com.
You can do it in a TokenFilter along with KeywordTokenizer with  writing custom code.

Hope this helps.


     

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: Indexing domain names?

by Erick Erickson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

<<<I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. >>>

If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
snippet
above makes me wonder if you've seen this class......

Best
Erick

On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@...> wrote:

> > Hi,
> >
> > How do I go about indexing domain names? I currently index
> > the domain, but
> > it only works if I put the exact full domain in. For
> > example:
> >
> > site:www.youtube.com (this works)
> > site:youtube.com (this doesn't work)
> >
> > I am using the StandardAnalyzer as most of the other fields
> > being indexed
> > are free form text. Currently the "site" field is stored
> > and tokenized.
>
> StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> token. Therefore they do not match. You can use SimpleAnalyzer which uses
> LetterTokenizer. So
>
> www.youtube.com will be broken into three tokens: www youtube com
> youtube.com     will be boreken into two tokens : youtube com
>
> By doing so site:youtube.com will bring you www.youtube.com
>
> But query site:youtube.com will also match a document like
> www.foo.com/youtube.com
>
> Note that LetterTokenizer uses Character.isLetter() method to break text.
> If your input has numbers like www.645cafe.com it will cause you problems.
>
> In your case it is better to extend CharTonizer and override protected
> boolean isTokenChar(char c) method according to your needs.
>
> > As an additional improvement it would be even better if
> > something like this
> > worked:
> >
> > site:youtube.com/foo
>
> To accomplish this, you can pre-process your queries to strip from first
> '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> youtube.com.
> You can do it in a TokenFilter along with KeywordTokenizer with  writing
> custom code.
>
> Hope this helps.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

Re: Indexing domain names?

by Chris Were :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the tips guys, got it working now.

Cheers,
Chris

On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerickson@...>wrote:

> <<<I am using the StandardAnalyzer as most of the other fields being
> indexed
> are free form text. >>>
>
> If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
> snippet
> above makes me wonder if you've seen this class......
>
> Best
> Erick
>
> On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@...> wrote:
>
> > > Hi,
> > >
> > > How do I go about indexing domain names? I currently index
> > > the domain, but
> > > it only works if I put the exact full domain in. For
> > > example:
> > >
> > > site:www.youtube.com (this works)
> > > site:youtube.com (this doesn't work)
> > >
> > > I am using the StandardAnalyzer as most of the other fields
> > > being indexed
> > > are free form text. Currently the "site" field is stored
> > > and tokenized.
> >
> > StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> > token. Therefore they do not match. You can use SimpleAnalyzer which uses
> > LetterTokenizer. So
> >
> > www.youtube.com will be broken into three tokens: www youtube com
> > youtube.com     will be boreken into two tokens : youtube com
> >
> > By doing so site:youtube.com will bring you www.youtube.com
> >
> > But query site:youtube.com will also match a document like
> > www.foo.com/youtube.com
> >
> > Note that LetterTokenizer uses Character.isLetter() method to break text.
> > If your input has numbers like www.645cafe.com it will cause you
> problems.
> >
> > In your case it is better to extend CharTonizer and override protected
> > boolean isTokenChar(char c) method according to your needs.
> >
> > > As an additional improvement it would be even better if
> > > something like this
> > > worked:
> > >
> > > site:youtube.com/foo
> >
> > To accomplish this, you can pre-process your queries to strip from first
> > '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> > youtube.com.
> > You can do it in a TokenFilter along with KeywordTokenizer with  writing
> > custom code.
> >
> > Hope this helps.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@...
> > For additional commands, e-mail: java-user-help@...
> >
> >
>