<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-373</id>
	<title>Nabble - Nutch - Dev</title>
	<updated>2009-11-26T03:39:10Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Nutch---Dev-f373.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch---Dev-f373.html" />
	<subtitle type="html"></subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26528174</id>
	<title>Re: wrong wiki front page</title>
	<published>2009-11-26T03:39:10Z</published>
	<updated>2009-11-26T03:39:10Z</updated>
	<author>
		<name>Alban Mouton</name>
	</author>
	<content type="html">Issue and solutions described here :&lt;a href=&quot;http://wiki.apache.org/httpd/HelpOnConfiguration#Default_front_page&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt; http://wiki.apache.org/httpd/HelpOnConfiguration#Default_front_page&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;div class=&quot;gmail_quote&quot;&gt;2009/11/24 Alban Mouton &lt;span dir=&quot;ltr&quot;&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26528174&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;alban83@...&lt;/a&gt;&amp;gt;&lt;/span&gt;&lt;br&gt;&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;

Hello everybody,&lt;br&gt;&lt;br&gt;I don&amp;#39;t know if it is a known issue, but it&amp;#39;s been like that since at least a couple of days so I figured I should tell someone. The root url for the nutch wiki &lt;a href=&quot;http://wiki.apache.org/nutch/&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/&lt;/a&gt; doesn&amp;#39;t redirect to &lt;a href=&quot;http://wiki.apache.org/nutch/FrontPage&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/FrontPage&lt;/a&gt; ! It&amp;#39;s annoying because that&amp;#39;s the url given by google and the nutch website. It might be a language detection problem because I see this ugly and not very helpful page : &lt;a href=&quot;http://wiki.apache.org/nutch/PageD%27Accueil&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/PageD%27Accueil&lt;/a&gt; (page d&amp;#39;accueil = home page in french).&lt;br&gt;


&lt;br&gt;Not much of a contribution for my first message here, but I hope to do more soon.&lt;br&gt;&lt;font color=&quot;#888888&quot;&gt;&lt;br&gt;Alban Mouton&lt;br&gt;
&lt;/font&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/wrong-wiki-front-page-tp26499343p26528174.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26523835</id>
	<title>[jira] Resolved: (NUTCH-185) XMLParser is configurable xml parser plugin.</title>
	<published>2009-11-25T19:16:39Z</published>
	<updated>2009-11-25T19:16:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Chris A. Mattmann resolved NUTCH-185.
&lt;br&gt;-------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Won't Fix
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 1.1
&lt;br&gt;&lt;br&gt;See comments related to NUTCH-767 in this issue's comments section. Once we address NUTCH-767, we get this functionality for free...
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; XMLParser is configurable xml parser plugin.
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-185
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-185&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-185&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: New Feature
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher, indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.7.2, 0.8, 0.8.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: OS Independent
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Rida Benjelloun
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Chris A. Mattmann
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Xml parser &amp;nbsp;is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. 
&lt;br&gt;&amp;gt; Informations :
&lt;br&gt;&amp;gt; 1- Copy &amp;quot;xmlparser-conf.xml&amp;quot; to the nutch/conf dir
&lt;br&gt;&amp;gt; 2- To index your custom XML file, you have to modify the &amp;quot;xmlparser-conf.xml&amp;quot;. 
&lt;br&gt;&amp;gt; This parser uses namespaces and XPATH to parse XML content
&lt;br&gt;&amp;gt; The config file do the mapping between the XML noeds (using XPATH) and lucene field. 
&lt;br&gt;&amp;gt; Example : &amp;lt;field name=&amp;quot;dctitle&amp;quot; xpath=&amp;quot;//dc:title&amp;quot; type=&amp;quot;Text&amp;quot; boost=&amp;quot;1.4&amp;quot; /&amp;gt; 
&lt;br&gt;&amp;gt; 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. 
&lt;br&gt;&amp;gt; If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
&lt;br&gt;&amp;gt; Example : 
&lt;br&gt;&amp;gt; &amp;lt;xmlIndexerProperties type=&amp;quot;filePerDocument&amp;quot; namespace=&amp;quot; &lt;a href=&quot;http://purl.org/dc/elements/1.1/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://purl.org/dc/elements/1.1/&lt;/a&gt;&amp;quot;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;field name=&amp;quot;dctitle&amp;quot; xpath=&amp;quot;//dc:title&amp;quot; type=&amp;quot;Text&amp;quot; boost=&amp;quot; 1.4&amp;quot; /&amp;gt; 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;field name=&amp;quot;dccreator&amp;quot; xpath=&amp;quot;//dc:creator&amp;quot; type=&amp;quot;keyword&amp;quot; boost=&amp;quot; 1.0&amp;quot; /&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;/xmlIndexerProperties&amp;gt;
&lt;br&gt;&amp;gt; 4- It is possible to define a default namespace that will be applied when the parser 
&lt;br&gt;&amp;gt; didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. 
&lt;br&gt;&amp;gt; Example :
&lt;br&gt;&amp;gt; &amp;lt;xmlIndexerProperties type=&amp;quot;filePerDocument&amp;quot; namespace=&amp;quot;default&amp;quot;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;field name=&amp;quot;xmlcontent&amp;quot; xpath=&amp;quot;//*&amp;quot; type=&amp;quot;Unstored&amp;quot; boost=&amp;quot;1.0&amp;quot; /&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;/xmlIndexerProperties&amp;gt;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Resolved%3A-%28NUTCH-185%29-XMLParser-is-configurable-xml-parser-plugin.-tp26523835p26523835.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520691</id>
	<title>[jira] Updated: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20</title>
	<published>2009-11-25T13:34:39Z</published>
	<updated>2009-11-25T13:34:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Dennis Kubes updated NUTCH-768:
&lt;br&gt;-------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: NUTCH-768-1-20091125.patch
&lt;br&gt;&lt;br&gt;I thought I was going to be able to do this without code changes. &amp;nbsp;No such luck. &amp;nbsp;
&lt;br&gt;&lt;br&gt;There are many, many deprecations as a result of this upgrade. &amp;nbsp;Anything that used the old Mapper and Reducer interfaces seems to have deprecated methods in it. &amp;nbsp;The NutchBean class needed to implement the two RPC*Bean interfaces to handle changes in Hadoop RPC (that could have been a leftover from 1.0 changes but I don't think so). &amp;nbsp;Also there are numerous changes to build scripts and the nutch bin script to support different hadoop jars.
&lt;br&gt;&lt;br&gt;There are also many new files for the conf directory as Hadoop has split out files and has new configuration files for new capabilities.
&lt;br&gt;&lt;br&gt;After all changes I was able to run everything in local and pseudo-distributed mode as well as test out local and distributed searching. &amp;nbsp;Everything seems to work fine. &amp;nbsp;After we make this upgrade I would recommend going back and updating all of the tool interfaces for the most recent APIs.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use Hadoop 0.20
&lt;br&gt;&amp;gt; ------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-768
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: All
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-768-1-20091125.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use the Hadoop 0.20 release. &amp;nbsp;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-768%29-Upgrade-Nutch-1.0-to-use-Hadoop-0.20-tp26461521p26520691.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520460</id>
	<title>[jira] Commented: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1</title>
	<published>2009-11-25T13:18:39Z</published>
	<updated>2009-11-25T13:18:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782624#action_12782624&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782624#action_12782624&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-772:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;Fixed in rev. 884277.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch to use Lucene 2.9.1
&lt;br&gt;&amp;gt; ---------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-772
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: lucene.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch to the latest Lucene release.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-772%29-Upgrade-Nutch-to-use-Lucene-2.9.1-tp26511889p26520460.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520463</id>
	<title>[jira] Closed: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1</title>
	<published>2009-11-25T13:18:39Z</published>
	<updated>2009-11-25T13:18:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;closed NUTCH-772.
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 1.1
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch to use Lucene 2.9.1
&lt;br&gt;&amp;gt; ---------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-772
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: lucene.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch to the latest Lucene release.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-772%29-Upgrade-Nutch-to-use-Lucene-2.9.1-tp26511889p26520463.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520209</id>
	<title>[jira] Closed: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-11-25T13:00:39Z</published>
	<updated>2009-11-25T13:00:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;closed NUTCH-760.
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 1.1
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p26520209.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520210</id>
	<title>[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-11-25T13:00:39Z</published>
	<updated>2009-11-25T13:00:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782617#action_12782617&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782617#action_12782617&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-760:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;I reworked the patch to get rid of any left-overs of static Configuration, and changed the concept of &amp;quot;schema&amp;quot; (which was misleading) to &amp;quot;mapping&amp;quot; throughout the patch and class names.
&lt;br&gt;&lt;br&gt;This is now committed in rev. 884269 - thanks!
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p26520210.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517681</id>
	<title>[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer</title>
	<published>2009-11-25T10:10:39Z</published>
	<updated>2009-11-25T10:10:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;closed NUTCH-761.
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 1.1
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Avoid cloningCrawlDatum in CrawlDbReducer 
&lt;br&gt;&amp;gt; ------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-761
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: optiCrawlReducer.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
&lt;br&gt;&amp;gt; The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, &amp;nbsp;we noticed an improvement of around 25-30% in the time spent in the reduce phase.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-761%29-Avoid-cloningCrawlDatum-in-CrawlDbReducer-tp26180618p26517681.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517682</id>
	<title>[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer</title>
	<published>2009-11-25T10:10:39Z</published>
	<updated>2009-11-25T10:10:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782537#action_12782537&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782537#action_12782537&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-761:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;I applied the patch with some changes - reverted the logic in the name of the boolean var, and applied the same method to other cases of non-multiple values. Committed in rev. 884224 - thanks!
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Avoid cloningCrawlDatum in CrawlDbReducer 
&lt;br&gt;&amp;gt; ------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-761
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: optiCrawlReducer.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
&lt;br&gt;&amp;gt; The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, &amp;nbsp;we noticed an improvement of around 25-30% in the time spent in the reduce phase.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-761%29-Avoid-cloningCrawlDatum-in-CrawlDbReducer-tp26180618p26517682.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517151</id>
	<title>[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB</title>
	<published>2009-11-25T09:38:39Z</published>
	<updated>2009-11-25T09:38:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782524#action_12782524&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782524#action_12782524&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-762:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;This class offers a strict superset of the current Generator functionality. Maintaining both tools would be cumbersome and error-prone. I propose to replace Generator with MultiGenerator (under the current name Generator).
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Alternative Generator which can generate several segments in one parse of the crawlDB
&lt;br&gt;&amp;gt; -------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-762
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-762&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-762&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: New Feature
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-762-MultiGenerator.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment.
&lt;br&gt;&amp;gt; The patch attached contains an implementation of a MultiGenerator &amp;nbsp;which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: 
&lt;br&gt;&amp;gt; * can filter the URLs by score
&lt;br&gt;&amp;gt; * normalisation is optional
&lt;br&gt;&amp;gt; * IP resolution is done ONLY on the entries which have been selected for &amp;nbsp;fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
&lt;br&gt;&amp;gt; * can max the number of URLs per host or domain (but not by IP)
&lt;br&gt;&amp;gt; * can choose to partition by host, domain or IP
&lt;br&gt;&amp;gt; Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. 
&lt;br&gt;&amp;gt; We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers.
&lt;br&gt;&amp;gt; The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
&lt;br&gt;&amp;gt; with the following options :
&lt;br&gt;&amp;gt; MultiGenerator &amp;lt;crawldb&amp;gt; &amp;lt;segments_dir&amp;gt; [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
&lt;br&gt;&amp;gt; where most parameters are similar to the default Generator - apart from : 
&lt;br&gt;&amp;gt; -noNorm (explicit)
&lt;br&gt;&amp;gt; -topN : max number of URLs per segment
&lt;br&gt;&amp;gt; -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments
&lt;br&gt;&amp;gt; Please give it a try and less me know what you think of it
&lt;br&gt;&amp;gt; Julien Nioche
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-762%29-Alternative-Generator-which-can-generate-several-segments-in-one-parse-of-the-crawlDB-tp26180999p26517151.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26516805</id>
	<title>[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice</title>
	<published>2009-11-25T09:22:39Z</published>
	<updated>2009-11-25T09:22:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782516#action_12782516&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782516#action_12782516&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-753:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;Fixed in rev. 884203 - thanks!
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Prevent new Fetcher to retrieve the robots twice
&lt;br&gt;&amp;gt; ------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-753
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-753&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-753&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-753.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS).
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-753%29-Prevent-new-Fetcher-to-retrieve-the-robots-twice-tp25334618p26516805.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26516806</id>
	<title>[jira] Closed: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice</title>
	<published>2009-11-25T09:22:39Z</published>
	<updated>2009-11-25T09:22:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;closed NUTCH-753.
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 1.1
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Prevent new Fetcher to retrieve the robots twice
&lt;br&gt;&amp;gt; ------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-753
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-753&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-753&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-753.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS).
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-753%29-Prevent-new-Fetcher-to-retrieve-the-robots-twice-tp25334618p26516806.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26516618</id>
	<title>[jira] Closed: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java</title>
	<published>2009-11-25T09:12:39Z</published>
	<updated>2009-11-25T09:12:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;closed NUTCH-773.
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; Assignee: Andrzej Bialecki 
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; some minor bugs in AbstractFetchSchedule.java
&lt;br&gt;&amp;gt; ---------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-773
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher, generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Reinhard Schwab
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-773.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; fixes some minor trivial bugs in AbstractFetchSchedule.java
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-773%29-some-minor-bugs-in-AbstractFetchSchedule.java-tp26513364p26516618.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26516621</id>
	<title>[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java</title>
	<published>2009-11-25T09:12:39Z</published>
	<updated>2009-11-25T09:12:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782509#action_12782509&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782509#action_12782509&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-773:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;That was a nasty bug - fixed in rev. 884198. Thanks!
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; some minor bugs in AbstractFetchSchedule.java
&lt;br&gt;&amp;gt; ---------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-773
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher, generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Reinhard Schwab
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-773.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; fixes some minor trivial bugs in AbstractFetchSchedule.java
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-773%29-some-minor-bugs-in-AbstractFetchSchedule.java-tp26513364p26516621.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26515338</id>
	<title>Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java</title>
	<published>2009-11-25T08:01:44Z</published>
	<updated>2009-11-25T08:01:44Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    Thanks&lt;br /&gt;
    On 25 November 2009 at 16:58 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26515338&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &lt;br /&gt;
    &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26515338&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;While you are doing changes and commits in this area I have been &lt;br /&gt;
    &amp;gt; &amp;gt; waiting for this patch https://issues.apache.org/jira/browse/NUTCH-760 &lt;br /&gt;
    &amp;gt; &amp;gt; of mine to be incorporated for a while now. Is it possible it get it in??&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; It&amp;#39;s on my agenda - I&amp;#39;ll apply the patch either today or tomorrow, time &lt;br /&gt;
    &amp;gt; permitting.&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; -- &lt;br /&gt;
    &amp;gt; Best regards,&lt;br /&gt;
    &amp;gt; Andrzej Bialecki&amp;#160; &amp;#160; &amp;#160;&amp;lt;&amp;gt;&amp;lt;&lt;br /&gt;
    &amp;gt;&amp;#160; &amp;#160;___. ___ ___ ___ _ _&amp;#160; &amp;#160;__________________________________&lt;br /&gt;
    &amp;gt; [__ || __|__/|__||\/|&amp;#160; Information Retrieval, Semantic Web&lt;br /&gt;
    &amp;gt; ___|||__||&amp;#160; \|&amp;#160; ||&amp;#160; |&amp;#160; Embedded Unix, System Integration&lt;br /&gt;
    &amp;gt; http://www.sigram.com&amp;#160; Contact: info at sigram dot com&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Re%3A-svn-commit%3A-r884075----lucene-nutch-trunk-src-java-org-apache-nutch-indexer-solr-SolrIndexer.java-tp26512753p26515338.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26515287</id>
	<title>[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-11-25T07:59:39Z</published>
	<updated>2009-11-25T07:59:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;updated NUTCH-760:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Assignee: Andrzej Bialecki 
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p26515287.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26515266</id>
	<title>Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java</title>
	<published>2009-11-25T07:58:21Z</published>
	<updated>2009-11-25T07:58:21Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26515266&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:
&lt;br&gt;&amp;gt; &amp;nbsp; While you are doing changes and commits in this area I have been 
&lt;br&gt;&amp;gt; waiting for this patch &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&amp;nbsp;
&lt;br&gt;&amp;gt; of mine to be incorporated for a while now. Is it possible it get it in??
&lt;br&gt;&lt;br&gt;It's on my agenda - I'll apply the patch either today or tomorrow, time 
&lt;br&gt;permitting.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Re%3A-svn-commit%3A-r884075----lucene-nutch-trunk-src-java-org-apache-nutch-indexer-solr-SolrIndexer.java-tp26512753p26515266.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26514461</id>
	<title>Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java</title>
	<published>2009-11-25T07:14:00Z</published>
	<updated>2009-11-25T07:14:00Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    While you are doing changes and commits in this area I have been waiting for this patch https://issues.apache.org/jira/browse/NUTCH-760 of mine to be incorporated for a while now. Is it possible it get it in??&lt;br /&gt;
    &lt;br /&gt;
    Regards,&lt;br /&gt;
    &lt;br /&gt;
    Dave&lt;br /&gt;
    &lt;br /&gt;
    On 25 November 2009 at 14:36 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514461&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &lt;br /&gt;
    &amp;gt; Oops.&amp;#160; Sorry about that.&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514461&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt; wrote:&lt;br /&gt;
    &amp;gt; &amp;gt; Author: ab&lt;br /&gt;
    &amp;gt; &amp;gt; Date: Wed Nov 25 12:44:34 2009&lt;br /&gt;
    &amp;gt; &amp;gt; New Revision: 884075&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; URL: http://svn.apache.org/viewvc?rev=884075&amp;amp;view=rev&lt;br /&gt;
    &amp;gt; &amp;gt; Log:&lt;br /&gt;
    &amp;gt; &amp;gt; Change access from private to public - this fixes Crawl.java breakage.&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; Modified:&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160;lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java&lt;br /&gt;
    &amp;gt; &amp;gt; URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java?rev=884075&amp;amp;r1=884074&amp;amp;r2=884075&amp;amp;view=diff&lt;br /&gt;
    &amp;gt; &amp;gt; ==============================================================================&lt;br /&gt;
    &amp;gt; &amp;gt; --- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java (original)&lt;br /&gt;
    &amp;gt; &amp;gt; +++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java Wed Nov 25 12:44:34 2009&lt;br /&gt;
    &amp;gt; &amp;gt; @@ -50,7 +50,7 @@&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; super(conf);&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; }&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &lt;br /&gt;
    &amp;gt; &amp;gt; -&amp;#160; private void indexSolr(String solrUrl, Path crawlDb, Path linkDb,&lt;br /&gt;
    &amp;gt; &amp;gt; +&amp;#160; public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; List&amp;lt;Path&amp;gt; segments) throws IOException {&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; LOG.info(&amp;quot;SolrIndexer: starting&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Re%3A-svn-commit%3A-r884075----lucene-nutch-trunk-src-java-org-apache-nutch-indexer-solr-SolrIndexer.java-tp26512753p26514461.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26513420</id>
	<title>[jira] Updated: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java</title>
	<published>2009-11-25T06:19:39Z</published>
	<updated>2009-11-25T06:19:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Reinhard Schwab updated NUTCH-773:
&lt;br&gt;----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Patch Info: [Patch Available]
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; some minor bugs in AbstractFetchSchedule.java
&lt;br&gt;&amp;gt; ---------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-773
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher, generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Reinhard Schwab
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-773.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; fixes some minor trivial bugs in AbstractFetchSchedule.java
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-773%29-some-minor-bugs-in-AbstractFetchSchedule.java-tp26513364p26513420.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26513421</id>
	<title>[jira] Updated: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java</title>
	<published>2009-11-25T06:19:39Z</published>
	<updated>2009-11-25T06:19:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Reinhard Schwab updated NUTCH-773:
&lt;br&gt;----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: NUTCH-773.patch
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; some minor bugs in AbstractFetchSchedule.java
&lt;br&gt;&amp;gt; ---------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-773
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher, generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Reinhard Schwab
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-773.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; fixes some minor trivial bugs in AbstractFetchSchedule.java
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-773%29-some-minor-bugs-in-AbstractFetchSchedule.java-tp26513364p26513421.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26513364</id>
	<title>[jira] Created: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java</title>
	<published>2009-11-25T06:15:39Z</published>
	<updated>2009-11-25T06:15:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">some minor bugs in AbstractFetchSchedule.java
&lt;br&gt;---------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-773
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-773&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-773&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Bug
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: fetcher, generator
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 1.0.0
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Reinhard Schwab
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Minor
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Fix For: 1.1
&lt;br&gt;&lt;br&gt;&lt;br&gt;fixes some minor trivial bugs in AbstractFetchSchedule.java
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-773%29-some-minor-bugs-in-AbstractFetchSchedule.java-tp26513364p26513364.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26512753</id>
	<title>Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java</title>
	<published>2009-11-25T05:36:51Z</published>
	<updated>2009-11-25T05:36:51Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">Oops. &amp;nbsp;Sorry about that.
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26512753&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt; wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Author: ab
&lt;br&gt;&amp;gt; Date: Wed Nov 25 12:44:34 2009
&lt;br&gt;&amp;gt; New Revision: 884075
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; URL: &lt;a href=&quot;http://svn.apache.org/viewvc?rev=884075&amp;view=rev&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/viewvc?rev=884075&amp;view=rev&lt;/a&gt;&lt;br&gt;&amp;gt; Log:
&lt;br&gt;&amp;gt; Change access from private to public - this fixes Crawl.java breakage.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Modified:
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Modified: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
&lt;br&gt;&amp;gt; URL: &lt;a href=&quot;http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java?rev=884075&amp;r1=884074&amp;r2=884075&amp;view=diff&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java?rev=884075&amp;r1=884074&amp;r2=884075&amp;view=diff&lt;/a&gt;&lt;br&gt;&amp;gt; ==============================================================================
&lt;br&gt;&amp;gt; --- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java (original)
&lt;br&gt;&amp;gt; +++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java Wed Nov 25 12:44:34 2009
&lt;br&gt;&amp;gt; @@ -50,7 +50,7 @@
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp;super(conf);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;}
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt; - &amp;nbsp;private void indexSolr(String solrUrl, Path crawlDb, Path linkDb,
&lt;br&gt;&amp;gt; + &amp;nbsp;public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;List&amp;lt;Path&amp;gt; segments) throws IOException {
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp;LOG.info(&amp;quot;SolrIndexer: starting&amp;quot;);
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Re%3A-svn-commit%3A-r884075----lucene-nutch-trunk-src-java-org-apache-nutch-indexer-solr-SolrIndexer.java-tp26512753p26512753.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26512233</id>
	<title>[jira] Updated: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1</title>
	<published>2009-11-25T05:00:39Z</published>
	<updated>2009-11-25T05:00:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;updated NUTCH-772:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: lucene.patch
&lt;br&gt;&lt;br&gt;Patch to commit shortly, if no objections.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch to use Lucene 2.9.1
&lt;br&gt;&amp;gt; ---------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-772
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: lucene.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch to the latest Lucene release.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-772%29-Upgrade-Nutch-to-use-Lucene-2.9.1-tp26511889p26512233.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26511889</id>
	<title>[jira] Created: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1</title>
	<published>2009-11-25T04:36:39Z</published>
	<updated>2009-11-25T04:36:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Upgrade Nutch to use Lucene 2.9.1
&lt;br&gt;---------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-772
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-772&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-772&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 1.1
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Andrzej Bialecki 
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Assignee: Andrzej Bialecki 
&lt;br&gt;&lt;br&gt;&lt;br&gt;Upgrade Nutch to the latest Lucene release.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-772%29-Upgrade-Nutch-to-use-Lucene-2.9.1-tp26511889p26511889.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26504036</id>
	<title>Re: Plugin Developement Help</title>
	<published>2009-11-24T14:00:24Z</published>
	<updated>2009-11-24T14:00:24Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    Sorry keep pressing&lt;br /&gt;
    &lt;br /&gt;
    But I dont quite understanding how the metadata is passed from the parse to the index if in my&lt;br /&gt;
    public ParseResult filter...&lt;br /&gt;
    &lt;br /&gt;
    Do this &lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; Parse parse = parseResult.get(content.getUrl());&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; metadata = parse.getData().getParseMeta();&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; metadata.add(&amp;quot;filter_html_data&amp;quot;, docTrans);&lt;br /&gt;
    &lt;br /&gt;
    Then return&lt;br /&gt;
    return parseResult;&lt;br /&gt;
    &lt;br /&gt;
    Is the data passed by reference into parseResult? because when I try and retrieve it in &lt;br /&gt;
    public NutchDocument filter...&lt;br /&gt;
    &lt;br /&gt;
    by doing&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; String html_filter_data = parse.getData().getMeta(&amp;quot;html_filter_data&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(html_filter_data);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; if (html_filter_data != null){&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________Adding filter data_______________________&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; doc.add(&amp;quot;html_filter_data&amp;quot;, html_filter_data);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; }&lt;br /&gt;
    I Never reach the add because the variable html_filter_data is empty&lt;br /&gt;
    &lt;br /&gt;
    any ideas&lt;br /&gt;
    &lt;br /&gt;
    Thanks for you help&lt;br /&gt;
    &lt;br /&gt;
    &lt;br /&gt;
    &lt;br /&gt;
    On 24 November 2009 at 12:27 &amp;quot;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26504036&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt;&amp;quot; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26504036&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &lt;br /&gt;
    &amp;gt; I thought I did but I thought before I did a bin/nutch index (or solrindex) it&lt;br /&gt;
    &amp;gt; would be stored somewhere it does seems to be getting to the doc.add bit which&lt;br /&gt;
    &amp;gt; makes me think the variable is empty&lt;br /&gt;
    &amp;gt; {code}&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; public void addIndexBackendOptions(Configuration conf) {&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;+_+_You called me _+_+&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LuceneWriter.addFieldOptions(&amp;quot;html_filter_data&amp;quot;, STORE.YES,&lt;br /&gt;
    &amp;gt; INDEX.UNTOKENIZED, conf);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; public NutchDocument filter(NutchDocument doc, Parse parse, Text url,&lt;br /&gt;
    &amp;gt; CrawlDatum datum, Inlinks inlinks) throws IndexingException {&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________FILTER_______________________&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; String html_filter_data = parse.getData().getMeta(&amp;quot;html_filter_data&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; if (html_filter_data != null){&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________Adding filter&lt;br /&gt;
    &amp;gt; data_______________________&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; doc.add(&amp;quot;html_filter_data&amp;quot;, html_filter_data);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; }&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; return doc;&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    &amp;gt; {code}&lt;br /&gt;
    &amp;gt; On 24 November 2009 at 12:05 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26504036&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26504036&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;Hi All,&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; I think I am just about finished my plugin (nutch 1.0) which adds extra &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; metadata to during parsing the problem I am having is it doesn&amp;#39;t seem to &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; be adding the data to the system (via luke or readseg). I looked at in &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; the wiki but it seems to be for 0.9 and the syntax looks different.&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; {code}&amp;#160; &amp;#160; &amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;public ParseResult filter(Content content, ParseResult parseResult, &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; HTMLMetaTags metaTags, DocumentFragment doc) {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;Metadata metadata = new Metadata();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;// parse the content&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;DocumentFragment root;&amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;String docTrans;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;try {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;byte[] contentInOctets = content.getContent();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;String input = new String(contentInOctets);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;docTrans = DocTransform.doTransform(input);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;Parse parse = parseResult.get(content.getUrl());&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata = parse.getData().getParseMeta();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata.add(&amp;quot;filter_html_data&amp;quot;, docTrans);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;} catch (Exception e) {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;e.printStackTrace(LogUtil.getWarnStream(LOG));&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160;return parseResult;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; {code}&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; Did you declare that you are adding this field in the &lt;br /&gt;
    &amp;gt; &amp;gt; IndexingFilter.addIndexBackendOptions(..) ? See how other indexing &lt;br /&gt;
    &amp;gt; &amp;gt; plugins do this.&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; -- &lt;br /&gt;
    &amp;gt; &amp;gt; Best regards,&lt;br /&gt;
    &amp;gt; &amp;gt; Andrzej Bialecki&amp;#160; &amp;#160; &amp;#160;&amp;lt;&amp;gt;&amp;lt;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;___. ___ ___ ___ _ _&amp;#160; &amp;#160;__________________________________&lt;br /&gt;
    &amp;gt; &amp;gt; [__ || __|__/|__||\/|&amp;#160; Information Retrieval, Semantic Web&lt;br /&gt;
    &amp;gt; &amp;gt; ___|||__||&amp;#160; \|&amp;#160; ||&amp;#160; |&amp;#160; Embedded Unix, System Integration&lt;br /&gt;
    &amp;gt; &amp;gt; http://www.sigram.com&amp;#160; Contact: info at sigram dot com&lt;br /&gt;
    &amp;gt; &amp;gt;
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Plugin-Developement-Help-tp26493932p26504036.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26503584</id>
	<title>[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20</title>
	<published>2009-11-24T13:27:39Z</published>
	<updated>2009-11-24T13:27:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782179#action_12782179&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782179#action_12782179&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-768:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;Are there any source code changes involved? If so, please upload a patch.
&lt;br&gt;&lt;br&gt;Did you check this in local, distributed or pseudo-distributed mode? In the past there have been errors related to local (or distributed) mode that wouldn't occur when running in other modes.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use Hadoop 0.20
&lt;br&gt;&amp;gt; ------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-768
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: All
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use the Hadoop 0.20 release. &amp;nbsp;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-768%29-Upgrade-Nutch-1.0-to-use-Hadoop-0.20-tp26461521p26503584.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26503569</id>
	<title>[jira] Commented: (NUTCH-771) Add WebGraph classes to the bin/nutch script</title>
	<published>2009-11-24T13:25:39Z</published>
	<updated>2009-11-24T13:25:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782177#action_12782177&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782177#action_12782177&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-771:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;+1 to adding these to the script. The names are cryptic, though ... this would call for a clear documentation in the script itself, and in appropriate places on the wiki.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Add WebGraph classes to the bin/nutch script
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-771
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-771&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-771&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: All, shell script
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Currently the webgraph jobs are called on the command line by calling main methods on their classes. &amp;nbsp;I propose to upgrade the bin/nutch shell script to allow calling these jobs as well. &amp;nbsp;This would include the webgraphdb, linkrank, scoreupdater, and nodedumper jobs.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-771%29-Add-WebGraph-classes-to-the-bin-nutch-script-tp26502990p26503569.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26503423</id>
	<title>[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20</title>
	<published>2009-11-24T13:17:39Z</published>
	<updated>2009-11-24T13:17:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782172#action_12782172&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782172#action_12782172&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Dennis Kubes commented on NUTCH-768:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;I have tested the upgrade with Hadoop 0.20. &amp;nbsp;To upgrade this correctly we do need to upgrade Xerces both in the main lib jars and within the lib-xml plugin. &amp;nbsp;I have upgraded to the most recent version of Xerces 2.9.x. &amp;nbsp;Having run through multiple full crawl and index cycles both on the new and old indexing frameworks, including the webgraphdb, and the solr indexing process, I didn't find any errors within the process. &amp;nbsp;If no one has any objections I will commit these changes within the next 24 hours.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use Hadoop 0.20
&lt;br&gt;&amp;gt; ------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-768
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-768&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-768&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.1
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: All
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Dennis Kubes
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Upgrade Nutch 1.0 to use the Hadoop 0.20 release. &amp;nbsp;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-768%29-Upgrade-Nutch-1.0-to-use-Hadoop-0.20-tp26461521p26503423.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26502990</id>
	<title>[jira] Created: (NUTCH-771) Add WebGraph classes to the bin/nutch script</title>
	<published>2009-11-24T12:47:39Z</published>
	<updated>2009-11-24T12:47:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Add WebGraph classes to the bin/nutch script
&lt;br&gt;--------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-771
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-771&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-771&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 1.1
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Environment: All, shell script
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Dennis Kubes
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Assignee: Dennis Kubes
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Fix For: 1.1
&lt;br&gt;&lt;br&gt;&lt;br&gt;Currently the webgraph jobs are called on the command line by calling main methods on their classes. &amp;nbsp;I propose to upgrade the bin/nutch shell script to allow calling these jobs as well. &amp;nbsp;This would include the webgraphdb, linkrank, scoreupdater, and nodedumper jobs.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-771%29-Add-WebGraph-classes-to-the-bin-nutch-script-tp26502990p26502990.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26499343</id>
	<title>wrong wiki front page</title>
	<published>2009-11-24T08:46:21Z</published>
	<updated>2009-11-24T08:46:21Z</updated>
	<author>
		<name>Alban Mouton</name>
	</author>
	<content type="html">Hello everybody,&lt;br&gt;&lt;br&gt;I don&amp;#39;t know if it is a known issue, but it&amp;#39;s been like that since at least a couple of days so I figured I should tell someone. The root url for the nutch wiki &lt;a href=&quot;http://wiki.apache.org/nutch/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/&lt;/a&gt; doesn&amp;#39;t redirect to &lt;a href=&quot;http://wiki.apache.org/nutch/FrontPage&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/FrontPage&lt;/a&gt; ! It&amp;#39;s annoying because that&amp;#39;s the url given by google and the nutch website. It might be a language detection problem because I see this ugly and not very helpful page : &lt;a href=&quot;http://wiki.apache.org/nutch/PageD%27Accueil&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/PageD%27Accueil&lt;/a&gt; (page d&amp;#39;accueil = home page in french).&lt;br&gt;
&lt;br&gt;Not much of a contribution for my first message here, but I hope to do more soon.&lt;br&gt;&lt;br&gt;Alban Mouton&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/wrong-wiki-front-page-tp26499343p26499343.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26498536</id>
	<title>[Nutch Wiki] Update of &quot;FrontPage&quot; by DennisKubes</title>
	<published>2009-11-24T08:00:33Z</published>
	<updated>2009-11-24T08:00:33Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;FrontPage&amp;quot; page has been changed by DennisKubes.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/FrontPage?action=diff&amp;rev1=122&amp;rev2=123&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/FrontPage?action=diff&amp;rev1=122&amp;rev2=123&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;* NonDefaultIntranetCrawlingOptions - Desirable options to add to your intranet crawling configuration.
&lt;br&gt;&amp;nbsp; &amp;nbsp;* RunningNutchAndSolr - How to configure Nutch to crawl, but post to Solr for search/index
&lt;br&gt;&amp;nbsp; &amp;nbsp;* NutchWithChineseAnalyzer - References to some Chinese articles explaining how to setup Nutch with 3rd party Chinese analyzers
&lt;br&gt;+ &amp;nbsp;* OptimizingCrawls - How to optimize your crawling/fetching speed with Nutch.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; == Nutch Development ==
&lt;br&gt;&amp;nbsp; &amp;nbsp;* [[Becoming_A_Nutch_Developer|Becoming a Nutch Developer]] - Start developing and contributing to Nutch.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22FrontPage%22-by-DennisKubes-tp26498536p26498536.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26498507</id>
	<title>[Nutch Wiki] Update of &quot;OptimizingCrawls&quot; by DennisKubes</title>
	<published>2009-11-24T07:59:31Z</published>
	<updated>2009-11-24T07:59:31Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;OptimizingCrawls&amp;quot; page has been changed by DennisKubes.
&lt;br&gt;The comment on this change is: Page about optimizing crawling speed.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;New page:
&lt;br&gt;'''Here are the things that could potentially slow down fetching'''
&lt;br&gt;&lt;br&gt;1) DNS setup
&lt;br&gt;&lt;br&gt;2) The number of crawlers you have, too many, too few.
&lt;br&gt;&lt;br&gt;3) Bandwidth limitations
&lt;br&gt;&lt;br&gt;4) Number of threads per host (politeness)
&lt;br&gt;&lt;br&gt;5) Uneven distribution of urls to fetch and politeness.
&lt;br&gt;&lt;br&gt;6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).
&lt;br&gt;&lt;br&gt;7) Many slow websites (again usually with an uneven distribution).
&lt;br&gt;&lt;br&gt;8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).
&lt;br&gt;&lt;br&gt;9) Others
&lt;br&gt;&lt;br&gt;'''Now how do we fix them'''
&lt;br&gt;&lt;br&gt;1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. &amp;nbsp;We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.
&lt;br&gt;&lt;br&gt;2) This would be number of map tasks * fetcher.threads.fetch. &amp;nbsp;So 10 map tasks * 20 threads = 200 fetchers at once. &amp;nbsp;Too many and you overload your system, too few and other factors and the machine sites idle. &amp;nbsp;You will need to play around with this setting for your setup.
&lt;br&gt;&lt;br&gt;3) Bandwidth limitations. &amp;nbsp;Use ntop, ganglia, and other monitoring tools &amp;nbsp;to determine how much bandwidth you are using. &amp;nbsp;Account for in and out bandwidth. &amp;nbsp;A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. &amp;nbsp;If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.
&lt;br&gt;&lt;br&gt;4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. &amp;nbsp;If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. &amp;nbsp;Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). &amp;nbsp;If increasing this &amp;gt; 0 you could also set fetcher.server.min.delay to some value &amp;gt; 0 for politeness to min and max bound the process.
&lt;br&gt;&lt;br&gt;5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. &amp;nbsp;For full web crawls you want an even distribution so all fetching threads can be active. &amp;nbsp;Setting generate.max.per.host to a value &amp;gt; 0 will limit the number of pages from a single host/domain to fetch.
&lt;br&gt;&lt;br&gt;6) Crawl-delay can be used and is obeyed by nutch in robots.txt. &amp;nbsp;Most sites don't use this setting but a few (some malicious do). &amp;nbsp;I have seen crawl-delays as high as 2 days in seconds. &amp;nbsp;The fetcher.max.crawl.delay variable will ignore pages with crawl delays &amp;gt; x. &amp;nbsp;I usually set this to 10 seconds, default is 30. &amp;nbsp;Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. &amp;nbsp;On the flip side, setting this to a low value will ignore and not fetch those pages.
&lt;br&gt;&lt;br&gt;7) Sometimes, manytimes websites are just slow. &amp;nbsp;Setting a low value for http.timeout helps. &amp;nbsp;The default is 10 seconds. &amp;nbsp;If you don't care and want as many pages as fast as possible, set it lower. &amp;nbsp;Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. &amp;nbsp;So even if you only have say 50 pages from a single site (which I still think is to many). &amp;nbsp;It may be waiting 10 seconds on each page. &amp;nbsp;The ftp.timeout can also be set if fetching ftp content. 8) Lots of content means slower fetching. &amp;nbsp;If downloading PDFs and other non-html documents this is especially true. &amp;nbsp;To avoid non-html content you can use the url filters. &amp;nbsp;I prefer the prefix and suffix filters. &amp;nbsp;The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.
&lt;br&gt;&lt;br&gt;9) Other things that could be causing slow fetching:
&lt;br&gt;&lt;br&gt;&amp;nbsp;* Max the number of open sockets/files on a machine. &amp;nbsp;You will start seeing IO errors or can't open socket errors.
&lt;br&gt;&amp;nbsp;* Poor routing. &amp;nbsp;Bad routers or home routers might not be able to handle the number of connections going through at once. &amp;nbsp;An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. &amp;nbsp;Use network trace and mapping tools if you think this is happening. &amp;nbsp;Upstream routing can also be a problem from your network provider.
&lt;br&gt;&amp;nbsp;* Bad network cards. &amp;nbsp;I have seen network cards flip once they reach a certain bandwidth point. &amp;nbsp;This was more prevalent on, at the time, newer gigabit cards. &amp;nbsp;Not usually my first thought but always a possibility. &amp;nbsp;Use tcpdump and network monitoring tools on the single interface.
&lt;br&gt;&lt;br&gt;That is about it from my perspective. &amp;nbsp;Feel free to add anything if anybody else thinks of other things.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22OptimizingCrawls%22-by-DennisKubes-tp26498507p26498507.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26498016</id>
	<title>Re: Plugin Developement Help</title>
	<published>2009-11-24T07:32:43Z</published>
	<updated>2009-11-24T07:32:43Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    Sorry its suppose to say &amp;quot;would be stored somewhere it DOESN&amp;#39;T seem to be getting to the doc.add bit which&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    On 24 November 2009 at 12:27 &amp;quot;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26498016&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt;&amp;quot; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26498016&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &lt;br /&gt;
    &amp;gt; I thought I did but I thought before I did a bin/nutch index (or solrindex) it&lt;br /&gt;
    &amp;gt; would be stored somewhere it does seems to be getting to the doc.add bit which&lt;br /&gt;
    &amp;gt; makes me think the variable is empty&lt;br /&gt;
    &amp;gt; {code}&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; public void addIndexBackendOptions(Configuration conf) {&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;+_+_You called me _+_+&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LuceneWriter.addFieldOptions(&amp;quot;html_filter_data&amp;quot;, STORE.YES,&lt;br /&gt;
    &amp;gt; INDEX.UNTOKENIZED, conf);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; public NutchDocument filter(NutchDocument doc, Parse parse, Text url,&lt;br /&gt;
    &amp;gt; CrawlDatum datum, Inlinks inlinks) throws IndexingException {&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________FILTER_______________________&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; String html_filter_data = parse.getData().getMeta(&amp;quot;html_filter_data&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; if (html_filter_data != null){&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________Adding filter&lt;br /&gt;
    &amp;gt; data_______________________&amp;quot;);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; doc.add(&amp;quot;html_filter_data&amp;quot;, html_filter_data);&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; }&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; &amp;#160; return doc;&lt;br /&gt;
    &amp;gt; &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    &amp;gt; {code}&lt;br /&gt;
    &amp;gt; On 24 November 2009 at 12:05 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26498016&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26498016&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;Hi All,&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; I think I am just about finished my plugin (nutch 1.0) which adds extra &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; metadata to during parsing the problem I am having is it doesn&amp;#39;t seem to &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; be adding the data to the system (via luke or readseg). I looked at in &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; the wiki but it seems to be for 0.9 and the syntax looks different.&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; {code}&amp;#160; &amp;#160; &amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;public ParseResult filter(Content content, ParseResult parseResult, &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; HTMLMetaTags metaTags, DocumentFragment doc) {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;Metadata metadata = new Metadata();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;// parse the content&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;DocumentFragment root;&amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;String docTrans;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;try {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;byte[] contentInOctets = content.getContent();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;String input = new String(contentInOctets);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;docTrans = DocTransform.doTransform(input);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;Parse parse = parseResult.get(content.getUrl());&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata = parse.getData().getParseMeta();&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata.add(&amp;quot;filter_html_data&amp;quot;, docTrans);&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;} catch (Exception e) {&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;e.printStackTrace(LogUtil.getWarnStream(LOG));&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160;return parseResult;&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt;&amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt; &amp;gt; {code}&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; Did you declare that you are adding this field in the &lt;br /&gt;
    &amp;gt; &amp;gt; IndexingFilter.addIndexBackendOptions(..) ? See how other indexing &lt;br /&gt;
    &amp;gt; &amp;gt; plugins do this.&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; -- &lt;br /&gt;
    &amp;gt; &amp;gt; Best regards,&lt;br /&gt;
    &amp;gt; &amp;gt; Andrzej Bialecki&amp;#160; &amp;#160; &amp;#160;&amp;lt;&amp;gt;&amp;lt;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;___. ___ ___ ___ _ _&amp;#160; &amp;#160;__________________________________&lt;br /&gt;
    &amp;gt; &amp;gt; [__ || __|__/|__||\/|&amp;#160; Information Retrieval, Semantic Web&lt;br /&gt;
    &amp;gt; &amp;gt; ___|||__||&amp;#160; \|&amp;#160; ||&amp;#160; |&amp;#160; Embedded Unix, System Integration&lt;br /&gt;
    &amp;gt; &amp;gt; http://www.sigram.com&amp;#160; Contact: info at sigram dot com&lt;br /&gt;
    &amp;gt; &amp;gt;
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Plugin-Developement-Help-tp26493932p26498016.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26494634</id>
	<title>Re: Plugin Developement Help</title>
	<published>2009-11-24T03:53:20Z</published>
	<updated>2009-11-24T03:53:20Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">Sorry I meant doesn't get to doc.add
&lt;br&gt;&lt;br&gt;David
&lt;br&gt;&lt;br&gt;On 24 Nov 2009, at 11:27, &amp;quot;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494634&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt;&amp;quot; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494634&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; 
&lt;br&gt;&amp;nbsp;&amp;gt; wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I thought I did but I thought before I did a bin/nutch index (or &amp;nbsp;
&lt;br&gt;&amp;gt; solrindex) it would be stored somewhere it does seems to be getting &amp;nbsp;
&lt;br&gt;&amp;gt; to the doc.add bit which makes me think the variable is empty
&lt;br&gt;&amp;gt; {code}
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; public void addIndexBackendOptions(Configuration conf) {
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; LOG.warn(&amp;quot;+_+_You called me _+_+&amp;quot;);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; LuceneWriter.addFieldOptions(&amp;quot;html_filter_data&amp;quot;, STORE.YES, &amp;nbsp;
&lt;br&gt;&amp;gt; INDEX.UNTOKENIZED, conf);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; }
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; public NutchDocument filter(NutchDocument doc, Parse parse, Text &amp;nbsp;
&lt;br&gt;&amp;gt; url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; LOG.warn 
&lt;br&gt;&amp;gt; (&amp;quot;________________________FILTER_______________________&amp;quot;);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; String html_filter_data = parse.getData().getMeta 
&lt;br&gt;&amp;gt; (&amp;quot;html_filter_data&amp;quot;);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; if (html_filter_data != null){
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; LOG.warn(&amp;quot;________________________Adding filter &amp;nbsp;
&lt;br&gt;&amp;gt; data_______________________&amp;quot;);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; doc.add(&amp;quot;html_filter_data&amp;quot;, html_filter_data);
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; }
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; return doc;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; }
&lt;br&gt;&amp;gt; {code}
&lt;br&gt;&amp;gt; On 24 November 2009 at 12:05 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494634&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494634&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; Hi All,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; I think I am just about finished my plugin (nutch 1.0) which &amp;nbsp;
&lt;br&gt;&amp;gt; adds extra
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; metadata to during parsing the problem I am having is it doesn't &amp;nbsp;
&lt;br&gt;&amp;gt; seem to
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; be adding the data to the system (via luke or readseg). I looked &amp;nbsp;
&lt;br&gt;&amp;gt; at in
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; the wiki but it seems to be for 0.9 and the syntax looks &amp;nbsp;
&lt;br&gt;&amp;gt; different.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; {code}
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; public ParseResult filter(Content content, ParseResult &amp;nbsp;
&lt;br&gt;&amp;gt; parseResult,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; HTMLMetaTags metaTags, DocumentFragment doc) {
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; Metadata metadata = new Metadata();
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; // parse the content
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; DocumentFragment root;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; String docTrans;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; try {
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; byte[] contentInOctets = content.getContent();
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; String input = new String(contentInOctets);
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; XSLTSimpleTransform DocTransform = new &amp;nbsp;
&lt;br&gt;&amp;gt; XSLTSimpleTransform();
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; docTrans = DocTransform.doTransform(input);
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Parse parse = parseResult.get(content.getUrl());
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; metadata = parse.getData().getParseMeta();
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; metadata.add(&amp;quot;filter_html_data&amp;quot;, docTrans);
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; } catch (Exception e) {
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; e.printStackTrace(LogUtil.getWarnStream(LOG));
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; }
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; return parseResult;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;nbsp; }
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; {code}
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Did you declare that you are adding this field in the
&lt;br&gt;&amp;gt; &amp;gt; IndexingFilter.addIndexBackendOptions(..) ? See how other indexing
&lt;br&gt;&amp;gt; &amp;gt; plugins do this.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; Best regards,
&lt;br&gt;&amp;gt; &amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; &amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; &amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Plugin-Developement-Help-tp26493932p26494634.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26494307</id>
	<title>Re: Plugin Developement Help</title>
	<published>2009-11-24T03:27:08Z</published>
	<updated>2009-11-24T03:27:08Z</updated>
	<author>
		<name>david.stuart@progressivealliance.co.uk</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    I thought I did but I thought before I did a bin/nutch index (or solrindex) it would be stored somewhere it does seems to be getting to the doc.add bit which makes me think the variable is empty&lt;br /&gt;
    {code}&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; public void addIndexBackendOptions(Configuration conf) {&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;+_+_You called me _+_+&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; LuceneWriter.addFieldOptions(&amp;quot;html_filter_data&amp;quot;, STORE.YES, INDEX.UNTOKENIZED, conf);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________FILTER_______________________&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; String html_filter_data = parse.getData().getMeta(&amp;quot;html_filter_data&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; if (html_filter_data != null){&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; LOG.warn(&amp;quot;________________________Adding filter data_______________________&amp;quot;);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160;&amp;#160;&amp;#160; &amp;#160; doc.add(&amp;quot;html_filter_data&amp;quot;, html_filter_data);&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; }&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; &amp;#160; return doc;&lt;br /&gt;
    &amp;#160;&amp;#160;&amp;#160; }&lt;br /&gt;
    {code}&lt;br /&gt;
    On 24 November 2009 at 12:05 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494307&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt; wrote:&lt;br /&gt;
    &lt;br /&gt;
    &amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26494307&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;david.stuart@...&lt;/a&gt; wrote:&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;Hi All,&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; I think I am just about finished my plugin (nutch 1.0) which adds extra &lt;br /&gt;
    &amp;gt; &amp;gt; metadata to during parsing the problem I am having is it doesn&amp;#39;t seem to &lt;br /&gt;
    &amp;gt; &amp;gt; be adding the data to the system (via luke or readseg). I looked at in &lt;br /&gt;
    &amp;gt; &amp;gt; the wiki but it seems to be for 0.9 and the syntax looks different.&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt; {code}&amp;#160; &amp;#160; &amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;public ParseResult filter(Content content, ParseResult parseResult, &lt;br /&gt;
    &amp;gt; &amp;gt; HTMLMetaTags metaTags, DocumentFragment doc) {&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;Metadata metadata = new Metadata();&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;// parse the content&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;DocumentFragment root;&amp;#160; &amp;#160;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;String docTrans;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;try {&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;byte[] contentInOctets = content.getContent();&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;String input = new String(contentInOctets);&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;XSLTSimpleTransform DocTransform = new XSLTSimpleTransform();&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;docTrans = DocTransform.doTransform(input);&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;Parse parse = parseResult.get(content.getUrl());&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata = parse.getData().getParseMeta();&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;metadata.add(&amp;quot;filter_html_data&amp;quot;, docTrans);&lt;br /&gt;
    &amp;gt; &amp;gt; &lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;} catch (Exception e) {&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160; &amp;#160;e.printStackTrace(LogUtil.getWarnStream(LOG));&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160; &lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160; &amp;#160;return parseResult;&lt;br /&gt;
    &amp;gt; &amp;gt;&amp;#160; &amp;#160;}&lt;br /&gt;
    &amp;gt; &amp;gt; {code}&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; Did you declare that you are adding this field in the &lt;br /&gt;
    &amp;gt; IndexingFilter.addIndexBackendOptions(..) ? See how other indexing &lt;br /&gt;
    &amp;gt; plugins do this.&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; &lt;br /&gt;
    &amp;gt; -- &lt;br /&gt;
    &amp;gt; Best regards,&lt;br /&gt;
    &amp;gt; Andrzej Bialecki&amp;#160; &amp;#160; &amp;#160;&amp;lt;&amp;gt;&amp;lt;&lt;br /&gt;
    &amp;gt;&amp;#160; &amp;#160;___. ___ ___ ___ _ _&amp;#160; &amp;#160;__________________________________&lt;br /&gt;
    &amp;gt; [__ || __|__/|__||\/|&amp;#160; Information Retrieval, Semantic Web&lt;br /&gt;
    &amp;gt; ___|||__||&amp;#160; \|&amp;#160; ||&amp;#160; |&amp;#160; Embedded Unix, System Integration&lt;br /&gt;
    &amp;gt; http://www.sigram.com&amp;#160; Contact: info at sigram dot com&lt;br /&gt;
    &amp;gt; &lt;br /&gt;
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Plugin-Developement-Help-tp26493932p26494307.html" />
</entry>

</feed>
