Plugins: when to perform web service requests, on fetch or on index?

View: New views
10 Messages — Rating Filter:   Alert me  

Plugins: when to perform web service requests, on fetch or on index?

by caezar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi All,

I'm writing several nutch plugins, which will perform a requests to some webservices for pages being indexed and store retrieved data in index. The question is: on what stage of crawling it is better to perform these webservice requests: on fetching or on indexing (in HtmlParseFilter or in IndexingFilter), in terms of performance, of course?

Nutch version is 1.0, indexer is SolrIndexer.

Thanks.

Re: Plugins: when to perform web service requests, on fetch or on index?

by joel gump :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

how about write standalone app. analyze data after crawl and index.

On Thu, Jun 18, 2009 at 6:57 PM, caezar<caezaris@...> wrote:

>
> Hi All,
>
> I'm writing several nutch plugins, which will perform a requests to some
> webservices for pages being indexed and store retrieved data in index. The
> question is: on what stage of crawling it is better to perform these
> webservice requests: on fetching or on indexing (in HtmlParseFilter or in
> IndexingFilter), in terms of performance, of course?
>
> Nutch version is 1.0, indexer is SolrIndexer.
>
> Thanks.
> --
> View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24089858.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Plugins: when to perform web service requests, on fetch or on index?

by Stefan Dlugolinsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I don't know how v 1.0 differs from v 0.9, but in v 0.9, I would do
those service requests in the stage of indexation (extension point
IndexingFilter), where you have several data prepared from previous
stage (by parsers, etc.), so you can use this data in the requests.
But it depends on what you exactly want, whether you want to use
parsed data in the requests. If not, you can call webservice requests
earlier from parsing stage (extension point Parse).

Here is something about core extension points:
http://wiki.apache.org/nutch/AboutPlugins

Steve

2009/6/18 caezar <caezaris@...>:

>
> Hi All,
>
> I'm writing several nutch plugins, which will perform a requests to some
> webservices for pages being indexed and store retrieved data in index. The
> question is: on what stage of crawling it is better to perform these
> webservice requests: on fetching or on indexing (in HtmlParseFilter or in
> IndexingFilter), in terms of performance, of course?
>
> Nutch version is 1.0, indexer is SolrIndexer.
>
> Thanks.
> --
> View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24089858.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Plugins: when to perform web service requests, on fetch or on index?

by caezar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Thank you for the response. Parsed data is not used in calls. Only page URL. So performance will be better if perform this requests on parsing stage?
Stefan Dlugolinsky wrote:
Hello,

I don't know how v 1.0 differs from v 0.9, but in v 0.9, I would do
those service requests in the stage of indexation (extension point
IndexingFilter), where you have several data prepared from previous
stage (by parsers, etc.), so you can use this data in the requests.
But it depends on what you exactly want, whether you want to use
parsed data in the requests. If not, you can call webservice requests
earlier from parsing stage (extension point Parse).

Here is something about core extension points:
http://wiki.apache.org/nutch/AboutPlugins

Steve

Re: Plugins: when to perform web service requests, on fetch or on index?

by caezar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I don't think it will be faster: crawling is performed on cluster.
joel gump wrote:
how about write standalone app. analyze data after crawl and index.

Re: Plugins: when to perform web service requests, on fetch or on index?

by Stefan Dlugolinsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

well, I would say that indexing stage is better than parsing, because in parsing stage there can be many parsing filters, which need to be execuded and they need some system resources (there are several parallel threads running), but generaly, there might not be any difference in performance according to calling stage. Also there can be more indexing filters, which also need some system resources. I would try both variants, measure performance on some subset of documents, compare the results and choose better. In addition of raising the performance, I would try to cache webservice requests localy, it can save something on repeating calls.

Steve

2009/6/18 caezar <caezaris@...>

Hi,

Thank you for the response. Parsed data is not used in calls. Only page URL.
So performance will be better if perform this requests on parsing stage?


Re: Plugins: when to perform web service requests, on fetch or on index?

by joel gump :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If only the urls is used, how about save urls to a database.
Then, another app check the database, call webservices.
Personally , i dont like mix something with crawl/index process.

On Thu, Jun 18, 2009 at 10:28 PM, caezar<caezaris@...> wrote:

>
> I don't think it will be faster: crawling is performed on cluster.
>
> joel gump wrote:
>>
>> how about write standalone app. analyze data after crawl and index.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24092992.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Plugins: when to perform web service requests, on fetch or on index?

by caezar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The main idea not to just store some useless data in index. There are searches performed on this data, combined with keywords searches, so I need this data in index.
joel gump wrote:
If only the urls is used, how about save urls to a database.
Then, another app check the database, call webservices.
Personally , i dont like mix something with crawl/index process.

Re: Plugins: when to perform web service requests, on fetch or on index?

by Kirby Bohling-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jun 18, 2009 at 1:42 PM, caezar<caezaris@...> wrote:
>
> The main idea not to just store some useless data in index. There are
> searches performed on this data, combined with keywords searches, so I need
> this data in index.

Given what you've said here, I'd look at the "index-more" plugin.  I
followed and the following pages when I added a category, and keywords
to pages (I added synonyms of domain specific terms to help find
additional data, without forcing the user to search and research).  I
followed the index-more plugin to figure out how to add them.  The
"explain" link of the search pages was very helpful to see how that
worked into the scoring.

I'm fairly sure that this is what you need according to what you've said:
http://wiki.apache.org/nutch/HowToMakeCustomSearch

This link was useful:
http://wiki.apache.org/nutch/WritingPluginExample-0.9

This is somewhat helpful:
http://wiki.apache.org/nutch/FAQ?highlight=(scoring)#head-347f304e874bee7ff37f8b1a69f9983103cc3150

Hope this is useful,
    Kirby



>
> joel gump wrote:
>>
>> If only the urls is used, how about save urls to a database.
>> Then, another app check the database, call webservices.
>> Personally , i dont like mix something with crawl/index process.
>>
>
> --
> View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24096246.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Plugins: when to perform web service requests, on fetch or on index?

by caezar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I know how to implement this. I'm asking: when it will work faster, on indexing or on parsing stage?
Kirby Bohling-2 wrote:
Given what you've said here, I'd look at the "index-more" plugin.  I
followed and the following pages when I added a category, and keywords
to pages (I added synonyms of domain specific terms to help find
additional data, without forcing the user to search and research).  I
followed the index-more plugin to figure out how to add them.  The
"explain" link of the search pages was very helpful to see how that
worked into the scoring.

I'm fairly sure that this is what you need according to what you've said:
http://wiki.apache.org/nutch/HowToMakeCustomSearch

This link was useful:
http://wiki.apache.org/nutch/WritingPluginExample-0.9

This is somewhat helpful:
http://wiki.apache.org/nutch/FAQ?highlight=(scoring)#head-347f304e874bee7ff37f8b1a69f9983103cc3150

Hope this is useful,
    Kirby