|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
Plugins: when to perform web service requests, on fetch or on index?Hi All,
I'm writing several nutch plugins, which will perform a requests to some webservices for pages being indexed and store retrieved data in index. The question is: on what stage of crawling it is better to perform these webservice requests: on fetching or on indexing (in HtmlParseFilter or in IndexingFilter), in terms of performance, of course? Nutch version is 1.0, indexer is SolrIndexer. Thanks. |
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?how about write standalone app. analyze data after crawl and index.
On Thu, Jun 18, 2009 at 6:57 PM, caezar<caezaris@...> wrote: > > Hi All, > > I'm writing several nutch plugins, which will perform a requests to some > webservices for pages being indexed and store retrieved data in index. The > question is: on what stage of crawling it is better to perform these > webservice requests: on fetching or on indexing (in HtmlParseFilter or in > IndexingFilter), in terms of performance, of course? > > Nutch version is 1.0, indexer is SolrIndexer. > > Thanks. > -- > View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24089858.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > > |
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?Hello,
I don't know how v 1.0 differs from v 0.9, but in v 0.9, I would do those service requests in the stage of indexation (extension point IndexingFilter), where you have several data prepared from previous stage (by parsers, etc.), so you can use this data in the requests. But it depends on what you exactly want, whether you want to use parsed data in the requests. If not, you can call webservice requests earlier from parsing stage (extension point Parse). Here is something about core extension points: http://wiki.apache.org/nutch/AboutPlugins Steve 2009/6/18 caezar <caezaris@...>: > > Hi All, > > I'm writing several nutch plugins, which will perform a requests to some > webservices for pages being indexed and store retrieved data in index. The > question is: on what stage of crawling it is better to perform these > webservice requests: on fetching or on indexing (in HtmlParseFilter or in > IndexingFilter), in terms of performance, of course? > > Nutch version is 1.0, indexer is SolrIndexer. > > Thanks. > -- > View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24089858.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > > |
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?Hi,
Thank you for the response. Parsed data is not used in calls. Only page URL. So performance will be better if perform this requests on parsing stage?
|
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?I don't think it will be faster: crawling is performed on cluster.
|
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?Hi,
well, I would say that indexing stage is better than parsing, because in parsing stage there can be many parsing filters, which need to be execuded and they need some system resources (there are several parallel threads running), but generaly, there might not be any difference in performance according to calling stage. Also there can be more indexing filters, which also need some system resources. I would try both variants, measure performance on some subset of documents, compare the results and choose better. In addition of raising the performance, I would try to cache webservice requests localy, it can save something on repeating calls. Steve 2009/6/18 caezar <caezaris@...>
|
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?If only the urls is used, how about save urls to a database.
Then, another app check the database, call webservices. Personally , i dont like mix something with crawl/index process. On Thu, Jun 18, 2009 at 10:28 PM, caezar<caezaris@...> wrote: > > I don't think it will be faster: crawling is performed on cluster. > > joel gump wrote: >> >> how about write standalone app. analyze data after crawl and index. >> >> > > -- > View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24092992.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > > |
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?The main idea not to just store some useless data in index. There are searches performed on this data, combined with keywords searches, so I need this data in index.
|
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?On Thu, Jun 18, 2009 at 1:42 PM, caezar<caezaris@...> wrote:
> > The main idea not to just store some useless data in index. There are > searches performed on this data, combined with keywords searches, so I need > this data in index. Given what you've said here, I'd look at the "index-more" plugin. I followed and the following pages when I added a category, and keywords to pages (I added synonyms of domain specific terms to help find additional data, without forcing the user to search and research). I followed the index-more plugin to figure out how to add them. The "explain" link of the search pages was very helpful to see how that worked into the scoring. I'm fairly sure that this is what you need according to what you've said: http://wiki.apache.org/nutch/HowToMakeCustomSearch This link was useful: http://wiki.apache.org/nutch/WritingPluginExample-0.9 This is somewhat helpful: http://wiki.apache.org/nutch/FAQ?highlight=(scoring)#head-347f304e874bee7ff37f8b1a69f9983103cc3150 Hope this is useful, Kirby > > joel gump wrote: >> >> If only the urls is used, how about save urls to a database. >> Then, another app check the database, call webservices. >> Personally , i dont like mix something with crawl/index process. >> > > -- > View this message in context: http://www.nabble.com/Plugins%3A-when-to-perform-web-service-requests%2C-on-fetch-or-on-index--tp24089858p24096246.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. > > |
|
|
Re: Plugins: when to perform web service requests, on fetch or on index?I know how to implement this. I'm asking: when it will work faster, on indexing or on parsing stage?
|
| Free embeddable forum powered by Nabble | Forum Help |