solr index question

View: New views
5 Messages — Rating Filter:   Alert me  

solr index question

by David Stuart-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I am being to use nutch to crawl site (great stuff btw) and combined it with solr pushing the nutch index using the solrindex command. I have set it up as specified on the wiki using the copyField url to id in the schema. Whilst this works fine it is stuff's up my inputs from other sources in solr (e.g. using the solr data import handler) as they have both id's and url's.
My question is why was the id field not pushed to solr and this weird copy field used because you already know it is the id is going to be the url. Are there any plans to change this or was a design decision made for other reasons. Could we look at implementing a nutch xml schema defining what basic nutch fields map to in your solr push. I have hacked in a fix to the SolrWriter.java but was wondering if it could be worked through into a long term supported option?

Regards,


David

Re: solr index question

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

david.stuart@... wrote:

>   Hi,
>
> I am being to use nutch to crawl site (great stuff btw) and combined it
> with solr pushing the nutch index using the solrindex command. I have
> set it up as specified on the wiki using the copyField url to id in the
> schema. Whilst this works fine it is stuff's up my inputs from other
> sources in solr (e.g. using the solr data import handler) as they have
> both id's and url's.
> My question is why was the id field not pushed to solr and this weird
> copy field used because you already know it is the id is going to be the
> url. Are there any plans to change this or was a design decision made
> for other reasons. Could we look at implementing a nutch xml schema
> defining what basic nutch fields map to in your solr push. I have hacked
> in a fix to the SolrWriter.java but was wondering if it could be worked
> through into a long term supported option?

This comes from the fact that Nutch doesn't really know the schema that
you are using in Solr, plus the fact that the functional equivalent of
"uniqueKey" in Nutch has always been named "url", which is hardcoded in
some places ... so, this is a deficiency in Nutch as well. Please note
that the reverse is true as well - SolrSearchBean hardcodes Solr's
uniqueKey to "id" instead of using a configurable name.

I agree that both these places should use configurable names. Can you
provide a patch?

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: solr index question

by David Stuart-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Andrzej,

I will have a go and putting something together, just wanted to make sure I had the background and I wasn't just fixing my problem.

Regards,

Dave


On 13 October 2009 at 23:04 Andrzej Bialecki <ab@...> wrote:

> david.stuart@... wrote:
> >   Hi,
> >
> > I am being to use nutch to crawl site (great stuff btw) and combined it
> > with solr pushing the nutch index using the solrindex command. I have
> > set it up as specified on the wiki using the copyField url to id in the
> > schema. Whilst this works fine it is stuff's up my inputs from other
> > sources in solr (e.g. using the solr data import handler) as they have
> > both id's and url's.
> > My question is why was the id field not pushed to solr and this weird
> > copy field used because you already know it is the id is going to be the
> > url. Are there any plans to change this or was a design decision made
> > for other reasons. Could we look at implementing a nutch xml schema
> > defining what basic nutch fields map to in your solr push. I have hacked
> > in a fix to the SolrWriter.java but was wondering if it could be worked
> > through into a long term supported option?
>
> This comes from the fact that Nutch doesn't really know the schema that
> you are using in Solr, plus the fact that the functional equivalent of
> "uniqueKey" in Nutch has always been named "url", which is hardcoded in
> some places ... so, this is a deficiency in Nutch as well. Please note
> that the reverse is true as well - SolrSearchBean hardcodes Solr's
> uniqueKey to "id" instead of using a configurable name.
>
> I agree that both these places should use configurable names. Can you
> provide a patch?
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: solr index question

by David Stuart-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Andrzej,

Patch supplied in ticket

Regards,

Dave

On 13 October 2009 at 23:04 Andrzej Bialecki <ab@...> wrote:

> david.stuart@... wrote:
> >   Hi,
> >
> > I am being to use nutch to crawl site (great stuff btw) and combined it
> > with solr pushing the nutch index using the solrindex command. I have
> > set it up as specified on the wiki using the copyField url to id in the
> > schema. Whilst this works fine it is stuff's up my inputs from other
> > sources in solr (e.g. using the solr data import handler) as they have
> > both id's and url's.
> > My question is why was the id field not pushed to solr and this weird
> > copy field used because you already know it is the id is going to be the
> > url. Are there any plans to change this or was a design decision made
> > for other reasons. Could we look at implementing a nutch xml schema
> > defining what basic nutch fields map to in your solr push. I have hacked
> > in a fix to the SolrWriter.java but was wondering if it could be worked
> > through into a long term supported option?
>
> This comes from the fact that Nutch doesn't really know the schema that
> you are using in Solr, plus the fact that the functional equivalent of
> "uniqueKey" in Nutch has always been named "url", which is hardcoded in
> some places ... so, this is a deficiency in Nutch as well. Please note
> that the reverse is true as well - SolrSearchBean hardcodes Solr's
> uniqueKey to "id" instead of using a configurable name.
>
> I agree that both these places should use configurable names. Can you
> provide a patch?
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: solr index question

by David Stuart-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Andrzej,

updated patch submitted including SolrSearchBean modifications anything else needed? If not how do I get this into trunk?
https://issues.apache.org/jira/browse/NUTCH-760

Regards,


Dave