<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-362</id>
	<title>Nabble - Nutch</title>
	<updated>2009-12-05T21:26:15Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Nutch-f362.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-f362.html" />
	<subtitle type="html">Nutch is web search software. It builds on the Apache Lucene search library, adding a crawler, web database (including full link graph), plugins for various document formats, user interface, etc. Nutch home is &lt;a href=&quot;http://lucene.apache.org/nutch/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;.</subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26662396</id>
	<title>RE: Indexing with solrindexer -&gt; OutOfMemoryError</title>
	<published>2009-12-05T21:26:15Z</published>
	<updated>2009-12-05T21:26:15Z</updated>
	<author>
		<name>miagomiago</name>
	</author>
	<content type="html">&lt;br&gt;hi,
&lt;br&gt;u have to make your segments smaller than that, just cut every segment in small pieces
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Subject: Indexing with solrindexer -&amp;gt; OutOfMemoryError
&lt;br&gt;&amp;gt; From: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26662396&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;felizimm@...&lt;/a&gt;
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26662396&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Date: Sun, 6 Dec 2009 01:35:04 +0100
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Hi,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; when trying to index four segments (~5 GB) with solrindexer, I get this
&lt;br&gt;&amp;gt; error in hadoop.log. There is no error in the logs of Tomcat, where I
&lt;br&gt;&amp;gt; deployed Solr. I crawled with &amp;quot;crawl&amp;quot;-command.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I`ve read that increasing the hadoop heap space will change nothing.
&lt;br&gt;&amp;gt; What can I do?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Thanks for help!
&lt;br&gt;&amp;gt; Felix.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009-12-06 00:21:51,061 WARN &amp;nbsp;mapred.LocalJobRunner - job_local_0001
&lt;br&gt;&amp;gt; java.lang.OutOfMemoryError: Java heap space
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.Arrays.copyOf(Arrays.java:2882)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.StringBuffer.append(StringBuffer.java:320)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.io.StringWriter.write(StringWriter.java:60)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.escape(XML.java:180)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.escapeCharData(XML.java:78)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.writeXML(XML.java:148)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:117)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:169)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(UpdateRequest.java:160)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:191)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:58)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.indexer.IndexerOutputFormat
&lt;br&gt;&amp;gt; $1.write(IndexerOutputFormat.java:54)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.indexer.IndexerOutputFormat
&lt;br&gt;&amp;gt; $1.write(IndexerOutputFormat.java:44)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask
&lt;br&gt;&amp;gt; $3.collect(ReduceTask.java:410)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.LocalJobRunner
&lt;br&gt;&amp;gt; $Job.run(LocalJobRunner.java:170)
&lt;br&gt;&amp;gt; 2009-12-06 00:21:51,650 FATAL solr.SolrIndexer - SolrIndexer:
&lt;br&gt;&amp;gt; java.io.IOException: Job failed!
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&amp;nbsp;		 	 &amp;nbsp; 		 &amp;nbsp;
&lt;br&gt;_________________________________________________________________
&lt;br&gt;Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
&lt;br&gt;&lt;a href=&quot;http://go.microsoft.com/?linkid=9691816&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://go.microsoft.com/?linkid=9691816&lt;/a&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Indexing-with-solrindexer--%3E-OutOfMemoryError-tp26660972p26662396.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26660972</id>
	<title>Indexing with solrindexer -&gt; OutOfMemoryError</title>
	<published>2009-12-05T16:35:04Z</published>
	<updated>2009-12-05T16:35:04Z</updated>
	<author>
		<name>Felix Zimmermann-2</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;when trying to index four segments (~5 GB) with solrindexer, I get this
&lt;br&gt;error in hadoop.log. There is no error in the logs of Tomcat, where I
&lt;br&gt;deployed Solr. I crawled with &amp;quot;crawl&amp;quot;-command.
&lt;br&gt;&lt;br&gt;I`ve read that increasing the hadoop heap space will change nothing.
&lt;br&gt;What can I do?
&lt;br&gt;&lt;br&gt;Thanks for help!
&lt;br&gt;Felix.
&lt;br&gt;&lt;br&gt;&lt;br&gt;2009-12-06 00:21:51,061 WARN &amp;nbsp;mapred.LocalJobRunner - job_local_0001
&lt;br&gt;java.lang.OutOfMemoryError: Java heap space
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.Arrays.copyOf(Arrays.java:2882)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.StringBuffer.append(StringBuffer.java:320)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.io.StringWriter.write(StringWriter.java:60)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.escape(XML.java:180)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.escapeCharData(XML.java:78)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.solr.common.util.XML.writeXML(XML.java:148)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:117)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:169)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(UpdateRequest.java:160)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:191)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:58)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.indexer.IndexerOutputFormat
&lt;br&gt;$1.write(IndexerOutputFormat.java:54)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.indexer.IndexerOutputFormat
&lt;br&gt;$1.write(IndexerOutputFormat.java:44)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask
&lt;br&gt;$3.collect(ReduceTask.java:410)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.LocalJobRunner
&lt;br&gt;$Job.run(LocalJobRunner.java:170)
&lt;br&gt;2009-12-06 00:21:51,650 FATAL solr.SolrIndexer - SolrIndexer:
&lt;br&gt;java.io.IOException: Job failed!
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104)
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Indexing-with-solrindexer--%3E-OutOfMemoryError-tp26660972p26660972.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26657756</id>
	<title>Re: HTTP Header problem</title>
	<published>2009-12-05T09:54:48Z</published>
	<updated>2009-12-05T09:54:48Z</updated>
	<author>
		<name>Kirk Gillock</name>
	</author>
	<content type="html">Thank you for the quick reply, Dennis. It was worth a shot. :-)
&lt;br&gt;&lt;br&gt;People are not typically searching for our own name on our own site but, 
&lt;br&gt;in case it did happen, we wanted to have the results be as clean as 
&lt;br&gt;possible. For our next crawls we'll change the agent name and version to 
&lt;br&gt;something else.
&lt;br&gt;&lt;br&gt;Thanks again,
&lt;br&gt;Kirk
&lt;br&gt;&lt;br&gt;&lt;br&gt;Dennis Kubes wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; There isn't a way to stop this from happening really except to change 
&lt;br&gt;&amp;gt; the agent name in the Nutch configuration. &amp;nbsp;When an http request is 
&lt;br&gt;&amp;gt; made, the agent name is sent as a header. &amp;nbsp;There are many pages as you 
&lt;br&gt;&amp;gt; say that simply have logs of different user-agents hitting their sites 
&lt;br&gt;&amp;gt; or have a script to spit back the user agent when a crawler is detected.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Kirk Gillock wrote:
&lt;br&gt;&amp;gt;&amp;gt; Hi fellow Nutch users.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Long time crawler, first time poster. :-)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; We're 23m pages into a 100m page crawl and our preliminary tests have 
&lt;br&gt;&amp;gt;&amp;gt; shown that a lot of pages contain our agent name, description, etc., 
&lt;br&gt;&amp;gt;&amp;gt; in their page content. Meaning, sites that have a script which show 
&lt;br&gt;&amp;gt;&amp;gt; http headers (typically to show browser information) causes the Nutch 
&lt;br&gt;&amp;gt;&amp;gt; crawler to store its own header information within the content of 
&lt;br&gt;&amp;gt;&amp;gt; that page. So when we search our index for &amp;quot;Isara&amp;quot; (our agent name) 
&lt;br&gt;&amp;gt;&amp;gt; we get thousands of results and they all have &amp;quot;Isara/Isara-1.0 (A 
&lt;br&gt;&amp;gt;&amp;gt; non-profit search engine benefiting charity.; &lt;a href=&quot;http://www.isara.org;&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org;&lt;/a&gt;&amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26657756&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;e-mail@...&lt;/a&gt;&amp;quot;, which is the content of our nutch-default.xml 
&lt;br&gt;&amp;gt;&amp;gt; file: http.agent.name, http.agent.description, http.agent.url, 
&lt;br&gt;&amp;gt;&amp;gt; http.agent.email, and http.agent.version .
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I've searched around and haven't found any information on how to stop 
&lt;br&gt;&amp;gt;&amp;gt; this from happening. Is there a solution and, if so, will it mean we 
&lt;br&gt;&amp;gt;&amp;gt; need to recrawl all those pages again or can we filter the current 
&lt;br&gt;&amp;gt;&amp;gt; database? Any suggestions would be greatly appreciated.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Thank you for developing such an important open-source application,
&lt;br&gt;&amp;gt;&amp;gt; Kirk Gillock
&lt;br&gt;&amp;gt;&amp;gt; Isara Charity Foundation
&lt;br&gt;&amp;gt;&amp;gt; Nong Khai, Thailand
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.isara.org&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org&lt;/a&gt;&lt;br&gt;&amp;gt; ------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; No virus found in this incoming message.
&lt;br&gt;&amp;gt; Checked by AVG - www.avg.com 
&lt;br&gt;&amp;gt; Version: 8.5.426 / Virus Database: 270.14.95/2546 - Release Date: 12/05/09 08:13:00
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;/div&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Agent-f372.html&quot; embed=&quot;fixTarget[372]&quot; target=&quot;_top&quot; &gt;Nutch - Agent&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/HTTP-Header-problem-tp26656109p26657756.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26657229</id>
	<title>[jira] Commented: (NUTCH-770) Timebomb for Fetcher</title>
	<published>2009-12-05T08:51:21Z</published>
	<updated>2009-12-05T08:51:21Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786443#action_12786443&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786443#action_12786443&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;MilleBii commented on NUTCH-770:
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;Tried it succesfully on a windows platform.
&lt;br&gt;&lt;br&gt;It does not work on a Ubuntu, pseudo-distributed hadoop configuration with mappers running in parallel ????
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Timebomb for Fetcher
&lt;br&gt;&amp;gt; --------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-770
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-770%29-Timebomb-for-Fetcher-tp26476267p26657229.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26657230</id>
	<title>[jira] Issue Comment Edited: (NUTCH-770) Timebomb for Fetcher</title>
	<published>2009-12-05T08:51:21Z</published>
	<updated>2009-12-05T08:51:21Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786443#action_12786443&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786443#action_12786443&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;MilleBii edited comment on NUTCH-770 at 12/5/09 4:50 PM:
&lt;br&gt;---------------------------------------------------------
&lt;br&gt;&lt;br&gt;Tried it succesfully on a windows platform.
&lt;br&gt;&lt;br&gt;It does not work on a Ubuntu, pseudo-distributed hadoop configuration with two mappers running in parallel ????
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; was (Author: millebii):
&lt;br&gt;&amp;nbsp; &amp;nbsp; Tried it succesfully on a windows platform.
&lt;br&gt;&lt;br&gt;It does not work on a Ubuntu, pseudo-distributed hadoop configuration with mappers running in parallel ????
&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;nbsp; 
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Timebomb for Fetcher
&lt;br&gt;&amp;gt; --------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-770
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-770%29-Timebomb-for-Fetcher-tp26476267p26657230.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26656906</id>
	<title>Re: HTTP Header problem</title>
	<published>2009-12-05T07:53:01Z</published>
	<updated>2009-12-05T07:53:01Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">There isn't a way to stop this from happening really except to change 
&lt;br&gt;the agent name in the Nutch configuration. &amp;nbsp;When an http request is 
&lt;br&gt;made, the agent name is sent as a header. &amp;nbsp;There are many pages as you 
&lt;br&gt;say that simply have logs of different user-agents hitting their sites 
&lt;br&gt;or have a script to spit back the user agent when a crawler is detected.
&lt;br&gt;&lt;br&gt;Dennis
&lt;br&gt;&lt;br&gt;Kirk Gillock wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi fellow Nutch users.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Long time crawler, first time poster. :-)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; We're 23m pages into a 100m page crawl and our preliminary tests have 
&lt;br&gt;&amp;gt; shown that a lot of pages contain our agent name, description, etc., in 
&lt;br&gt;&amp;gt; their page content. Meaning, sites that have a script which show http 
&lt;br&gt;&amp;gt; headers (typically to show browser information) causes the Nutch crawler 
&lt;br&gt;&amp;gt; to store its own header information within the content of that page. So 
&lt;br&gt;&amp;gt; when we search our index for &amp;quot;Isara&amp;quot; (our agent name) we get thousands 
&lt;br&gt;&amp;gt; of results and they all have &amp;quot;Isara/Isara-1.0 (A non-profit search 
&lt;br&gt;&amp;gt; engine benefiting charity.; &lt;a href=&quot;http://www.isara.org;&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26656906&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;e-mail@...&lt;/a&gt;&amp;quot;, 
&lt;br&gt;&amp;gt; which is the content of our nutch-default.xml file: http.agent.name, 
&lt;br&gt;&amp;gt; http.agent.description, http.agent.url, http.agent.email, and 
&lt;br&gt;&amp;gt; http.agent.version .
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I've searched around and haven't found any information on how to stop 
&lt;br&gt;&amp;gt; this from happening. Is there a solution and, if so, will it mean we 
&lt;br&gt;&amp;gt; need to recrawl all those pages again or can we filter the current 
&lt;br&gt;&amp;gt; database? Any suggestions would be greatly appreciated.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Thank you for developing such an important open-source application,
&lt;br&gt;&amp;gt; Kirk Gillock
&lt;br&gt;&amp;gt; Isara Charity Foundation
&lt;br&gt;&amp;gt; Nong Khai, Thailand
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.isara.org&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org&lt;/a&gt;&lt;br&gt;&lt;/div&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Agent-f372.html&quot; embed=&quot;fixTarget[372]&quot; target=&quot;_top&quot; &gt;Nutch - Agent&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/HTTP-Header-problem-tp26656109p26656906.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26656318</id>
	<title>State of nutchbase</title>
	<published>2009-12-05T06:56:49Z</published>
	<updated>2009-12-05T06:56:49Z</updated>
	<author>
		<name>Alban Mouton</name>
	</author>
	<content type="html">Hello,&lt;br&gt;&lt;br&gt;I have looked a little into nutch code and mailing lists. I think the nutchbase branch (&lt;a href=&quot;http://issues.apache.org/jira/browse/NUTCH-650&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://issues.apache.org/jira/browse/NUTCH-650&lt;/a&gt;) is very interesting, with a good potential to improve code clarity and flexibility (I find data structure quite obscure in current version). The issue is untouched since last august, so my question is : can nutchbase really be part of nutch 1.1 ? Is there still much work to do or is it almost ready ? Is it a worthy issue for an interested developer with a (still !) limited knowledge of the project ?&lt;br&gt;
&lt;br&gt;So far I have only tried to run nutchbase in eclipse by applying the tutorial (&lt;a href=&quot;http://wiki.apache.org/nutch/RunNutchInEclipse1.0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/RunNutchInEclipse1.0&lt;/a&gt;) but I run in errors when building, mostly from Parser and tests. I may start by cleaning this up.&lt;br&gt;
&lt;br&gt;Eclipse build errors:&lt;br&gt;&lt;br&gt;Description    Resource    Path    Location    Type&lt;br&gt;FetcherOutputFormat cannot be resolved to a type    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 362    Java Problem&lt;br&gt;
Generator.GENERATE_MAX_PER_HOST_BY_IP cannot be resolved    TestGenerator.java    /nutchbase/src/test/org/apache/nutch/crawl    line 202    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 229    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    BasicFields.java    /nutchbase/src/java/org/apache/nutch/indexer/field    line 335    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    ExtParser.java    /nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line 138    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    MSBaseParser.java    /nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line 108    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    OOParser.java    /nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line 103    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    PdfParser.java    /nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line 155    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    RSSParser.java    /nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line 187    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    SWFParser.java    /nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line 113    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    TestIndexingFilters.java    /nutchbase/src/test/org/apache/nutch/indexer    line 45    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    TestMoreIndexingFilter.java    /nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more    line 61    Java Problem&lt;br&gt;ParseImpl cannot be resolved to a type    TextParser.java    /nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text    line 55    Java Problem&lt;br&gt;
ParseImpl cannot be resolved to a type    ZipParser.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 105    Java Problem&lt;br&gt;ParseResult cannot be resolved    ExtParser.java    /nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line 137    Java Problem&lt;br&gt;
ParseResult cannot be resolved    MSBaseParser.java    /nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line 107    Java Problem&lt;br&gt;ParseResult cannot be resolved    OOParser.java    /nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line 103    Java Problem&lt;br&gt;
ParseResult cannot be resolved    PdfParser.java    /nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line 155    Java Problem&lt;br&gt;ParseResult cannot be resolved    RSSParser.java    /nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line 187    Java Problem&lt;br&gt;
ParseResult cannot be resolved    SWFParser.java    /nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line 113    Java Problem&lt;br&gt;ParseResult cannot be resolved    TextParser.java    /nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text    line 55    Java Problem&lt;br&gt;
ParseResult cannot be resolved    ZipParser.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 105    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 159    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    CCParseFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 267    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    CCParseFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 267    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    ExtParser.java    /nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line 69    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 106    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 108    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 108    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 211    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 221    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    HTMLLanguageParser.java    /nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang    line 90    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    HTMLLanguageParser.java    /nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang    line 90    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    MSBaseParser.java    /nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line 64    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    MSExcelParser.java    /nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel    line 40    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    MSPowerPointParser.java    /nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint    line 44    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    MSWordParser.java    /nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword    line 43    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    OOParser.java    /nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line 63    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    PdfParser.java    /nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line 69    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    RSSParser.java    /nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line 80    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    RelTagParser.java    /nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag    line 68    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    RelTagParser.java    /nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag    line 68    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    SWFParser.java    /nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line 64    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    SWFParser.java    /nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line 125    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    TestFeedParser.java    /nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed    line 94    Java Problem&lt;br&gt;
ParseResult cannot be resolved to a type    TextParser.java    /nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text    line 41    Java Problem&lt;br&gt;ParseResult cannot be resolved to a type    ZipParser.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 55    Java Problem&lt;br&gt;
The constructor Fetcher(Configuration) is undefined    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 100    Java Problem&lt;br&gt;The constructor Fetcher(Configuration) is undefined    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 177    Java Problem&lt;br&gt;
The constructor Generator(Configuration) is undefined    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 94    Java Problem&lt;br&gt;The constructor Generator(Configuration) is undefined    TestGenerator.java    /nutchbase/src/test/org/apache/nutch/crawl    line 312    Java Problem&lt;br&gt;
The constructor Injector(Configuration) is undefined    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 90    Java Problem&lt;br&gt;The constructor Injector(Configuration) is undefined    TestInjector.java    /nutchbase/src/test/org/apache/nutch/crawl    line 70    Java Problem&lt;br&gt;
The constructor NutchWritable(ParseImpl) is undefined    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 229    Java Problem&lt;br&gt;The import org.apache.nutch.fetcher.FetcherOutputFormat cannot be resolved    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 44    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 50    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    BasicFields.java    /nutchbase/src/java/org/apache/nutch/indexer/field    line 61    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    ExtParser.java    /nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line 26    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    MSBaseParser.java    /nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line 39    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    PdfParser.java    /nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line 41    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    RSSParser.java    /nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line 41    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    TestExtParser.java    /nutchbase/src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext    line 26    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    TestIndexingFilters.java    /nutchbase/src/test/org/apache/nutch/indexer    line 26    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    TestMSWordParser.java    /nutchbase/src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword    line 26    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    TestMoreIndexingFilter.java    /nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more    line 29    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    TestZipParser.java    /nutchbase/src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip    line 26    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseImpl cannot be resolved    ZipParser.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 33    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseImpl cannot be resolved    ZipTextExtractor.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 41    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 51    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    ExtParser.java    /nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line 21    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    FeedParser.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line 43    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    HTMLLanguageParser.java    /nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang    line 33    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    MSBaseParser.java    /nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line 40    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    MSExcelParser.java    /nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel    line 20    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    MSPowerPointParser.java    /nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint    line 20    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    MSWordParser.java    /nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword    line 21    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    PdfParser.java    /nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line 37    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    RSSParser.java    /nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line 36    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    RelTagParser.java    /nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag    line 38    Java Problem&lt;br&gt;
The import org.apache.nutch.parse.ParseResult cannot be resolved    TestFeedParser.java    /nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed    line 32    Java Problem&lt;br&gt;The import org.apache.nutch.parse.ParseResult cannot be resolved    ZipParser.java    /nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line 34    Java Problem&lt;br&gt;
The method calculate(WebTableRow, Parse) in the type Signature is not applicable for the arguments (Content, Parse)    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 187    Java Problem&lt;br&gt;
The method calculate(WebTableRow, Parse) in the type Signature is not applicable for the arguments (Content, Parse)    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 208    Java Problem&lt;br&gt;
The method fetch(String, int, boolean) from the type Fetcher is not visible    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 178    Java Problem&lt;br&gt;The method fetch(String, int, boolean) in the type Fetcher is not applicable for the arguments (Path, int, boolean)    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 101    Java Problem&lt;br&gt;
The method generate(String, long, long, boolean) in the type Generator is not applicable for the arguments (Path, Path, int, int, long, boolean, boolean)    TestGenerator.java    /nutchbase/src/test/org/apache/nutch/crawl    line 313    Java Problem&lt;br&gt;
The method generate(String, long, long, boolean) in the type Generator is not applicable for the arguments (Path, Path, int, long, long, boolean, boolean)    TestFetcher.java    /nutchbase/src/test/org/apache/nutch/fetcher    line 95    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 200    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 211    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 213    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 216    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 230    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc    line 244    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    BasicFields.java    /nutchbase/src/java/org/apache/nutch/indexer/field    line 386    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    BasicFields.java    /nutchbase/src/java/org/apache/nutch/indexer/field    line 395    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    CCIndexingFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 55    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    CCParseFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 280    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    CCParseFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 286    Java Problem&lt;br&gt;The method getData() is undefined for the type Parse    CCParseFilter.java    /nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch    line 291    Java Problem&lt;br&gt;
The method getData() is undefined for the type Parse    FeedIndexingFilter.java    /nutchbase/src/plugin/feed/src/java/org/apache/nutch/indexer/feed    line 76    Java Problem&lt;br&gt;&lt;br&gt;
&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/State-of-nutchbase-tp26656318p26656318.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26656109</id>
	<title>HTTP Header problem</title>
	<published>2009-12-05T06:29:20Z</published>
	<updated>2009-12-05T06:29:20Z</updated>
	<author>
		<name>Kirk Gillock</name>
	</author>
	<content type="html">Hi fellow Nutch users.
&lt;br&gt;&lt;br&gt;Long time crawler, first time poster. :-)
&lt;br&gt;&lt;br&gt;We're 23m pages into a 100m page crawl and our preliminary tests have 
&lt;br&gt;shown that a lot of pages contain our agent name, description, etc., in 
&lt;br&gt;their page content. Meaning, sites that have a script which show http 
&lt;br&gt;headers (typically to show browser information) causes the Nutch crawler 
&lt;br&gt;to store its own header information within the content of that page. So 
&lt;br&gt;when we search our index for &amp;quot;Isara&amp;quot; (our agent name) we get thousands 
&lt;br&gt;of results and they all have &amp;quot;Isara/Isara-1.0 (A non-profit search 
&lt;br&gt;engine benefiting charity.; &lt;a href=&quot;http://www.isara.org;&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org;&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26656109&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;e-mail@...&lt;/a&gt;&amp;quot;, 
&lt;br&gt;which is the content of our nutch-default.xml file: http.agent.name, 
&lt;br&gt;http.agent.description, http.agent.url, http.agent.email, and 
&lt;br&gt;http.agent.version .
&lt;br&gt;&lt;br&gt;I've searched around and haven't found any information on how to stop 
&lt;br&gt;this from happening. Is there a solution and, if so, will it mean we 
&lt;br&gt;need to recrawl all those pages again or can we filter the current 
&lt;br&gt;database? Any suggestions would be greatly appreciated.
&lt;br&gt;&lt;br&gt;Thank you for developing such an important open-source application,
&lt;br&gt;Kirk Gillock
&lt;br&gt;Isara Charity Foundation
&lt;br&gt;Nong Khai, Thailand
&lt;br&gt;&lt;a href=&quot;http://www.isara.org&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isara.org&lt;/a&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Agent-f372.html&quot; embed=&quot;fixTarget[372]&quot; target=&quot;_top&quot; &gt;Nutch - Agent&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/HTTP-Header-problem-tp26656109p26656109.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26656071</id>
	<title>[jira] Updated: (NUTCH-767) Update Tika to v0.5  for the MimeType detection</title>
	<published>2009-12-05T06:25:20Z</published>
	<updated>2009-12-05T06:25:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Julien Nioche updated NUTCH-767:
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: NUTCH-767-part3.patch
&lt;br&gt;&lt;br&gt;the problems with the test comes from the fact that tika's detection of the mimetypes based on content returns &amp;quot;text/plain&amp;quot; &amp;nbsp;when no mimetype can be identified, e.g. in our case because we have an empty byte array as content.
&lt;br&gt;&lt;br&gt;Tika's MimeTypes used to have a default value which was used in MimeUtil to determine when to use the type guessed by Tika but it has been removed since. The best course of action is probably to take into account Tika's guess only if it is not &amp;nbsp;&amp;quot;text/plain&amp;quot; or &amp;quot;application/octet-stream&amp;quot;, which is what this patch implements.
&lt;br&gt;&lt;br&gt;The expected mime types in the test class are set to their original values (pre patch v2) apart from the one which used Tika's default Mime Type. &amp;nbsp;
&lt;br&gt;&lt;br&gt;J.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Update Tika to v0.5 &amp;nbsp;for the MimeType detection
&lt;br&gt;&amp;gt; -----------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-767
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-767-part2.patch, NUTCH-767-part3.patch, NUTCH-767.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; Original Estimate: 0h
&lt;br&gt;&amp;gt; &amp;nbsp;Remaining Estimate: 0h
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-767%29-Update-version-of-Tika-for-the-MimeType-detection-tp26409413p26656071.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26656014</id>
	<title>[jira] Reopened: (NUTCH-767) Update Tika to v0.5  for the MimeType detection</title>
	<published>2009-12-05T06:19:20Z</published>
	<updated>2009-12-05T06:19:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Julien Nioche reopened NUTCH-767:
&lt;br&gt;---------------------------------
&lt;br&gt;&lt;br&gt;&lt;br&gt;the problem with the test class has been investigated. am reopening the issue so that we can mark it as definitely fixed 
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Update Tika to v0.5 &amp;nbsp;for the MimeType detection
&lt;br&gt;&amp;gt; -----------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-767
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-767-part2.patch, NUTCH-767.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; Original Estimate: 0h
&lt;br&gt;&amp;gt; &amp;nbsp;Remaining Estimate: 0h
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-767%29-Update-version-of-Tika-for-the-MimeType-detection-tp26409413p26656014.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26655130</id>
	<title>Re: Fetch failing ?</title>
	<published>2009-12-05T04:17:45Z</published>
	<updated>2009-12-05T04:17:45Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Thx again Julien,
&lt;br&gt;&lt;br&gt;Yes I'm going to buy myself the Hadoop book, because I thought I could do
&lt;br&gt;without but I realize that I need to make good use of hadooop.
&lt;br&gt;&lt;br&gt;Didn't know you could split fetching &amp; parsing: &amp;nbsp;so I suppose you just issue
&lt;br&gt;nutch fetch &amp;lt;segment&amp;gt; -noParsing, followed by nutch parse &amp;lt;segment&amp;gt;. I will
&lt;br&gt;try on my next run.
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;2009/12/5 Julien Nioche &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26655130&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
&lt;br&gt;&amp;gt; does NOT affect the memory used for the map/ reduce jobs. Maybe you should
&lt;br&gt;&amp;gt; invest a bit of time reading about Hadoop first?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; As for your memory problem it could be due to the parsing and not the
&lt;br&gt;&amp;gt; fetching. If you don't already do so I suggest that you separate the
&lt;br&gt;&amp;gt; fetching from the parsing. First that will tell you which part fails + if
&lt;br&gt;&amp;gt; it
&lt;br&gt;&amp;gt; does fail in the parsing then you would not need to refetch the content
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; J.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/12/5 MilleBii &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26655130&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; My fetch cycle failed on the following initial error :
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; java.io.IOException: Task process exit with nonzero status of 65.
&lt;br&gt;&amp;gt; &amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Than it makes a second attempt and after 3 hours I bump on that error
&lt;br&gt;&amp;gt; &amp;gt; (altough I had double HADOOP_HEAPSIZE):
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Any idea what the initial error is or could be ?
&lt;br&gt;&amp;gt; &amp;gt; For the second one, I'm going to reduce number of threads... but I'm
&lt;br&gt;&amp;gt; &amp;gt; wondering if there could be a memory leak ? And I don't how to trace
&lt;br&gt;&amp;gt; that.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; -MilleBii-
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Fetch-failing---tp26653874p26655130.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26654970</id>
	<title>Re: Fetch failing ?</title>
	<published>2009-12-05T03:56:50Z</published>
	<updated>2009-12-05T03:56:50Z</updated>
	<author>
		<name>Julien Nioche-4</name>
	</author>
	<content type="html">HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
&lt;br&gt;does NOT affect the memory used for the map/ reduce jobs. Maybe you should
&lt;br&gt;invest a bit of time reading about Hadoop first?
&lt;br&gt;&lt;br&gt;As for your memory problem it could be due to the parsing and not the
&lt;br&gt;fetching. If you don't already do so I suggest that you separate the
&lt;br&gt;fetching from the parsing. First that will tell you which part fails + if it
&lt;br&gt;does fail in the parsing then you would not need to refetch the content
&lt;br&gt;&lt;br&gt;J.
&lt;br&gt;&lt;br&gt;2009/12/5 MilleBii &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26654970&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; My fetch cycle failed on the following initial error :
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.IOException: Task process exit with nonzero status of 65.
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Than it makes a second attempt and after 3 hours I bump on that error
&lt;br&gt;&amp;gt; (altough I had double HADOOP_HEAPSIZE):
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.lang.OutOfMemoryError: GC overhead limit exceeded
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Any idea what the initial error is or could be ?
&lt;br&gt;&amp;gt; For the second one, I'm going to reduce number of threads... but I'm
&lt;br&gt;&amp;gt; wondering if there could be a memory leak ? And I don't how to trace that.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; -MilleBii-
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;DigitalPebble Ltd
&lt;br&gt;&lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Fetch-failing---tp26653874p26654970.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653874</id>
	<title>Fetch failing ?</title>
	<published>2009-12-05T00:50:34Z</published>
	<updated>2009-12-05T00:50:34Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">My fetch cycle failed on the following initial error :
&lt;br&gt;&lt;br&gt;java.io.IOException: Task process exit with nonzero status of 65.
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
&lt;br&gt;&lt;br&gt;Than it makes a second attempt and after 3 hours I bump on that error
&lt;br&gt;(altough I had double HADOOP_HEAPSIZE):
&lt;br&gt;&lt;br&gt;java.lang.OutOfMemoryError: GC overhead limit exceeded
&lt;br&gt;&lt;br&gt;&lt;br&gt;Any idea what the initial error is or could be ?
&lt;br&gt;For the second one, I'm going to reduce number of threads... but I'm
&lt;br&gt;wondering if there could be a memory leak ? And I don't how to trace that.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Fetch-failing---tp26653874p26653874.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653828</id>
	<title>Re: How to drop page content at fetch stages ?</title>
	<published>2009-12-05T00:42:11Z</published>
	<updated>2009-12-05T00:42:11Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Thx, a bit too complex for me right now. I don't -yet- fully understand this
&lt;br&gt;map/reduce technique.
&lt;br&gt;But I'll keep the idea for a future development.
&lt;br&gt;&lt;br&gt;2009/12/4 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26653828&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Sorry, segments, not indexes.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Dennis Kubes wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; You would need to write a custom MapReduce job to run through the indexes
&lt;br&gt;&amp;gt;&amp;gt; and only keeps the ones identified by your plugin. &amp;nbsp;Be sure to update the
&lt;br&gt;&amp;gt;&amp;gt; CrawlDb with the extracted urls before you drop the content from the
&lt;br&gt;&amp;gt;&amp;gt; segments.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Hi guys,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I'm looking if I can optimize the size occupied on disk by my segments.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I have implemented a topical-scoring plugin... this means I know at that
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; steps if I should keep that page content or not.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Is there a way to drop some pages content after parsing it, but of course
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; keep the links because I want to follow the graph ?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; PS: Prune is no option to me because it only cleans up the indexes, not
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; the
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; segments and my indexer does that clean-up very well.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-drop-page-content-at-fetch-stages---tp26650126p26653828.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653825</id>
	<title>Nutch - create my own repository</title>
	<published>2009-12-05T00:41:02Z</published>
	<updated>2009-12-05T00:41:02Z</updated>
	<author>
		<name>zzeran</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;I'm developing my own set of tools, plugins and some minor code changes to
&lt;br&gt;Nutch.
&lt;br&gt;&lt;br&gt;I still want to get updates from the main Nutch repository, but I would
&lt;br&gt;like to keep my own SVN for tracking my local code changes.
&lt;br&gt;&lt;br&gt;I'm using normal shell SVN (I have no expirence with GIT) to track my
&lt;br&gt;changes.
&lt;br&gt;&lt;br&gt;My question is - can I create a branch from the main repository to my own
&lt;br&gt;repository, which will only track my changes and keep getting updates from
&lt;br&gt;Nutch main repository with easy merge?
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;Eran
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch---create-my-own-repository-tp26653825p26653825.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653645</id>
	<title>Re: unsubscribe from nutch-user</title>
	<published>2009-12-04T23:59:55Z</published>
	<updated>2009-12-04T23:59:55Z</updated>
	<author>
		<name>M S Ram</name>
	</author>
	<content type="html">I did it many times. But I am still receiving these mails.
&lt;br&gt;&lt;br&gt;prashant ullegaddi wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Take a look at it:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/nutch/mailing_lists.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/mailing_lists.html&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; or
&lt;br&gt;&amp;gt; probably sending a blank mail to:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26653645&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user-unsubscribe@...&lt;/a&gt; also
&lt;br&gt;&amp;gt; work.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks,
&lt;br&gt;&amp;gt; Prashant.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Fri, Dec 4, 2009 at 8:30 PM, M S Ram &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26653645&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;msram@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&amp;gt;&amp;gt; Same here. Please remove my ID also from the mailing list.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Thanks,
&lt;br&gt;&amp;gt;&amp;gt; MSR
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; rengan xu wrote:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; To whom it may concern,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Hello! Because I will use this E-mail for special purpose. I will use
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; nutch-user.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Thank you!
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; 
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; 
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/unsubscribe-from-nutch-user-tp26643628p26653645.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653542</id>
	<title>Nutch image extraction</title>
	<published>2009-12-04T23:36:18Z</published>
	<updated>2009-12-04T23:36:18Z</updated>
	<author>
		<name>manishkbawne</name>
	</author>
	<content type="html">Hi, 
&lt;br&gt;I am using nutch to crawl the data from the web. Now I want to extract the images using nutch. Can somebody please suggest me some way how to do that or sugeest me some url?
&lt;br&gt;&lt;br&gt;Regards,
&lt;br&gt;Manish Bawne
&lt;br&gt;Software Engineer
&lt;br&gt;Biz Integra Systems Pvt Ltd
&lt;br&gt;&lt;a href=&quot;http://www.bizhandel.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.bizhandel.com&lt;/a&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-image-extraction-tp26653542p26653542.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653021</id>
	<title>Hudson build is back to normal: Nutch-trunk #1002</title>
	<published>2009-12-04T20:54:23Z</published>
	<updated>2009-12-04T20:54:23Z</updated>
	<author>
		<name>Apache Hudson Server</name>
	</author>
	<content type="html">See &amp;lt;&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1002/changes&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1002/changes&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Build-failed-in-Hudson%3A-Nutch-trunk--998-tp26586289p26653021.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26653020</id>
	<title>[jira] Commented: (NUTCH-767) Update Tika to v0.5  for the MimeType detection</title>
	<published>2009-12-04T20:54:20Z</published>
	<updated>2009-12-04T20:54:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786339#action_12786339&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12786339#action_12786339&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Hudson commented on NUTCH-767:
&lt;br&gt;------------------------------
&lt;br&gt;&lt;br&gt;Integrated in Nutch-trunk #1002 (See [&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1002/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1002/&lt;/a&gt;])
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;Fix a failing test - still needs more work.
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Update Tika to v0.5 &amp;nbsp;for the MimeType detection
&lt;br&gt;&amp;gt; -----------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-767
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-767&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-767&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Andrzej Bialecki 
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 1.1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-767-part2.patch, NUTCH-767.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; Original Estimate: 0h
&lt;br&gt;&amp;gt; &amp;nbsp;Remaining Estimate: 0h
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---Dev-f373.html&quot; embed=&quot;fixTarget[373]&quot; target=&quot;_top&quot; &gt;Nutch - Dev&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-767%29-Update-version-of-Tika-for-the-MimeType-detection-tp26409413p26653020.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26650632</id>
	<title>Re: How to drop page content at fetch stages ?</title>
	<published>2009-12-04T14:55:50Z</published>
	<updated>2009-12-04T14:55:50Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">Sorry, segments, not indexes.
&lt;br&gt;&lt;br&gt;Dennis Kubes wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; You would need to write a custom MapReduce job to run through the 
&lt;br&gt;&amp;gt; indexes and only keeps the ones identified by your plugin. &amp;nbsp;Be sure to 
&lt;br&gt;&amp;gt; update the CrawlDb with the extracted urls before you drop the content 
&lt;br&gt;&amp;gt; from the segments.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt; Hi guys,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I'm looking if I can optimize the size occupied on disk by my segments.
&lt;br&gt;&amp;gt;&amp;gt; I have implemented a topical-scoring plugin... this means I know at that
&lt;br&gt;&amp;gt;&amp;gt; steps if I should keep that page content or not.
&lt;br&gt;&amp;gt;&amp;gt; Is there a way to drop some pages content after parsing it, but of course
&lt;br&gt;&amp;gt;&amp;gt; keep the links because I want to follow the graph ?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; PS: Prune is no option to me because it only cleans up the indexes, 
&lt;br&gt;&amp;gt;&amp;gt; not the
&lt;br&gt;&amp;gt;&amp;gt; segments and my indexer does that clean-up very well.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&lt;/div&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-drop-page-content-at-fetch-stages---tp26650126p26650632.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26650506</id>
	<title>Re: How to drop page content at fetch stages ?</title>
	<published>2009-12-04T14:47:31Z</published>
	<updated>2009-12-04T14:47:31Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">You would need to write a custom MapReduce job to run through the 
&lt;br&gt;indexes and only keeps the ones identified by your plugin. &amp;nbsp;Be sure to 
&lt;br&gt;update the CrawlDb with the extracted urls before you drop the content 
&lt;br&gt;from the segments.
&lt;br&gt;&lt;br&gt;Dennis
&lt;br&gt;&lt;br&gt;MilleBii wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi guys,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I'm looking if I can optimize the size occupied on disk by my segments.
&lt;br&gt;&amp;gt; I have implemented a topical-scoring plugin... this means I know at that
&lt;br&gt;&amp;gt; steps if I should keep that page content or not.
&lt;br&gt;&amp;gt; Is there a way to drop some pages content after parsing it, but of course
&lt;br&gt;&amp;gt; keep the links because I want to follow the graph ?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; PS: Prune is no option to me because it only cleans up the indexes, not the
&lt;br&gt;&amp;gt; segments and my indexer does that clean-up very well.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&lt;/div&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-drop-page-content-at-fetch-stages---tp26650126p26650506.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26650126</id>
	<title>How to drop page content at fetch stages ?</title>
	<published>2009-12-04T14:18:23Z</published>
	<updated>2009-12-04T14:18:23Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Hi guys,
&lt;br&gt;&lt;br&gt;I'm looking if I can optimize the size occupied on disk by my segments.
&lt;br&gt;I have implemented a topical-scoring plugin... this means I know at that
&lt;br&gt;steps if I should keep that page content or not.
&lt;br&gt;Is there a way to drop some pages content after parsing it, but of course
&lt;br&gt;keep the links because I want to follow the graph ?
&lt;br&gt;&lt;br&gt;PS: Prune is no option to me because it only cleans up the indexes, not the
&lt;br&gt;segments and my indexer does that clean-up very well.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-drop-page-content-at-fetch-stages---tp26650126p26650126.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26648656</id>
	<title>Re: What is the best choice: nutch/lucene or nutch/solr?</title>
	<published>2009-12-04T12:20:26Z</published>
	<updated>2009-12-04T12:20:26Z</updated>
	<author>
		<name>Otis Gospodnetic-2</name>
	</author>
	<content type="html">Sounds like Nutch for crawling to gather the data, custom tools to read the gathered data, call the KV store, construct SolrInputDocuments, and index those to Solr. &amp;nbsp;If you want Solr and not Lucene, which is a bigger question that I can't answer without knowing the details.
&lt;br&gt;&lt;br&gt;&amp;nbsp;Otis
&lt;br&gt;--
&lt;br&gt;Sematext -- &lt;a href=&quot;http://sematext.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sematext.com/&lt;/a&gt;&amp;nbsp;-- Solr - Lucene - Nutch
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;----- Original Message ----
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; From: Mr Hadoop &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26648656&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mrhadoop@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26648656&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Sent: Fri, December 4, 2009 2:51:47 PM
&lt;br&gt;&amp;gt; Subject: What is the best choice: nutch/lucene or nutch/solr?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I am going over mailing list and still didn't find an answer.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; For a project, I need to crawl the web, index it and merge that content with
&lt;br&gt;&amp;gt; another site's content which is stored inside the key-value storage system.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; What is the best approach to merge these two contents in to a lucene index,
&lt;br&gt;&amp;gt; solr index or keep the index separate but merge during the search query
&lt;br&gt;&amp;gt; results?
&lt;/div&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/What-is-the-best-choice%3A-nutch-lucene-or-nutch-solr--tp26648251p26648656.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26648251</id>
	<title>What is the best choice: nutch/lucene or nutch/solr?</title>
	<published>2009-12-04T11:51:47Z</published>
	<updated>2009-12-04T11:51:47Z</updated>
	<author>
		<name>clusterboy</name>
	</author>
	<content type="html">I am going over mailing list and still didn't find an answer.
&lt;br&gt;&lt;br&gt;For a project, I need to crawl the web, index it and merge that content with
&lt;br&gt;another site's content which is stored inside the key-value storage system.
&lt;br&gt;&lt;br&gt;What is the best approach to merge these two contents in to a lucene index,
&lt;br&gt;solr index or keep the index separate but merge during the search query
&lt;br&gt;results?
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/What-is-the-best-choice%3A-nutch-lucene-or-nutch-solr--tp26648251p26648251.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26647740</id>
	<title>RE: Problems with a new Installation of Nutch</title>
	<published>2009-12-04T11:15:57Z</published>
	<updated>2009-12-04T11:15:57Z</updated>
	<author>
		<name>Tom Landvoigt</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;Does anyone know what packages I have to install in Suse to get Nutch running?
&lt;br&gt;&lt;br&gt;I have another installation with nutch where everything is fine. So I copied the hole installation. It's also an Suse linux but it is in 64 bit and I don’t installed it. 
&lt;br&gt;&lt;br&gt;But the same problem. 
&lt;br&gt;At the moment I installed the following packages:
&lt;br&gt;Tomcat 6
&lt;br&gt;Openjdk devel 1.6
&lt;br&gt;Sun java devel 1.6
&lt;br&gt;Ant 1.7
&lt;br&gt;&lt;br&gt;Now it is enough for today.
&lt;br&gt;&lt;br&gt;Hope someone can help.
&lt;br&gt;&lt;br&gt;Tom
&lt;br&gt;&lt;br&gt;-----Original Message-----
&lt;br&gt;From: MilleBii [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;] 
&lt;br&gt;Sent: Freitag, 4. Dezember 2009 17:31
&lt;br&gt;To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;Subject: Re: Problems with a new Installation of Nutch
&lt;br&gt;&lt;br&gt;I don't know that hadoop uses tomcat... But I think it uses Jetty
&lt;br&gt;instead. The nodes communicate via http: so you need some kind of web
&lt;br&gt;server... And for monitorin its the best way
&lt;br&gt;&lt;br&gt;2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I don't have tomcat on this system because I don't want to use the
&lt;br&gt;&amp;gt; websearch. But if it is necessary for hadoop what I don’t think I will
&lt;br&gt;&amp;gt; install it.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; nutch@ip-10-224-113-210:/nutch/search&amp;gt; ./bin/hadoop fs -ls /
&lt;br&gt;&amp;gt; Found 1 items
&lt;br&gt;&amp;gt; -rw-r--r-- &amp;nbsp; 2 nutch supergroup &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 2009-12-04 14:04 /url.txt
&lt;br&gt;&amp;gt; nutch@ip-10-224-113-210:/nutch/search&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I get the normal answer but the file is empty.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -----Original Message-----
&lt;br&gt;&amp;gt; From: MilleBii [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;]
&lt;br&gt;&amp;gt; Sent: Freitag, 4. Dezember 2009 15:06
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=4&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Subject: Re: Problems with a new Installation of Nutch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Did you check with the web interface ? It gives a lot of info you can
&lt;br&gt;&amp;gt; even browse the file system.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Try hadoop fs -ls to see what it gives you ?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26647740&amp;i=5&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt; Hallo,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&lt;br&gt;&amp;gt;&amp;gt; can't put data in the hdfs.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I formatted the namenode and start the hdfs with start all.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp;All &amp;nbsp;java processes start properly, but when I want to make hadoop fs
&lt;br&gt;&amp;gt;&amp;gt; -put something / I get these logs:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; nutch@bla:/nutch/search&amp;gt; ./bin/hadoop fs -put
&lt;br&gt;&amp;gt;&amp;gt; /tmp/hadoop-nutch-tasktracker.pid blub
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; put: Protocol not available
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; DATA NODE LOG on the master:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:15,566 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:15,582 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,483 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,614 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,882 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,883 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:17,827 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:17,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@36527f
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; NAME NODE LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:11,539 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:11,573 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,488 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,565 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,569 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,582 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50070
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@173ec72
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; SECONDARY NAMENODE LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:19,163 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:19,207 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:20,365 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:20,454 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,533 INFO &amp;nbsp;servlet.XMLConfiguration - No
&lt;br&gt;&amp;gt;&amp;gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&lt;br&gt;&amp;gt;&amp;gt; files and default/dynamic servlets only
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,206 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,785 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50090
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@297ffb
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt;&amp;gt; Period &amp;nbsp; :3600 secs (60 min)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Log Size
&lt;br&gt;&amp;gt;&amp;gt; Trigger &amp;nbsp; &amp;nbsp;:67108864 bytes (65536 KB)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:55:23,908 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt;&amp;gt; done. New Image Size: 1056
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; HADOOP LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,708 WARN &amp;nbsp;hdfs.DFSClient - DataStreamer Exception:
&lt;br&gt;&amp;gt;&amp;gt; java.io.IOException: Unable to create new block.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&lt;br&gt;&amp;gt;&amp;gt; FSClient.java:2722)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&lt;br&gt;&amp;gt;&amp;gt; ava:1996)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&lt;br&gt;&amp;gt;&amp;gt; ent.java:2183)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Error Recovery for block
&lt;br&gt;&amp;gt;&amp;gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Could not get block
&lt;br&gt;&amp;gt;&amp;gt; locations. Source file &amp;quot;/user/nutch/blub/hadoop-nutch-tasktracker.pid&amp;quot; -
&lt;br&gt;&amp;gt;&amp;gt; Aborting...
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; DATA NODE LOG on the slave
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:49,433 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:49,438 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,288 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,357 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,816 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,820 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@b02928
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; HADOOP SITE XML
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(yes here is the right ip):9000&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; The name of the default file system. Either the literal string
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot; or a host:port for NDFS.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&lt;br&gt;&amp;gt;&amp;gt; zu finden ist. --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(here to):9001&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; The host and port that the MapReduce job tracker runs at. If
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot;, then jobs are run in-process as a single map and
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; reduce task.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.map.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.map tasks to be number of slave hosts
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.reduce.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.reduce tasks to be number of slave hosts
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;-Xmx1500m&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.jobtracker.restart.recover&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&lt;br&gt;&amp;gt;&amp;gt; auf der Festplatte jeder Instanz speichert. --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/name&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/data&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/system&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/local&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&lt;br&gt;&amp;gt;&amp;gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Thanks
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Tom
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; -MilleBii-
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Problems-with-a-new-Installation-of-Nutch-tp26641822p26647740.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26645189</id>
	<title>Re: Problems with a new Installation of Nutch</title>
	<published>2009-12-04T08:30:58Z</published>
	<updated>2009-12-04T08:30:58Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">I don't know that hadoop uses tomcat... But I think it uses Jetty
&lt;br&gt;instead. The nodes communicate via http: so you need some kind of web
&lt;br&gt;server... And for monitorin its the best way
&lt;br&gt;&lt;br&gt;2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26645189&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I don't have tomcat on this system because I don't want to use the
&lt;br&gt;&amp;gt; websearch. But if it is necessary for hadoop what I don’t think I will
&lt;br&gt;&amp;gt; install it.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; nutch@ip-10-224-113-210:/nutch/search&amp;gt; ./bin/hadoop fs -ls /
&lt;br&gt;&amp;gt; Found 1 items
&lt;br&gt;&amp;gt; -rw-r--r-- &amp;nbsp; 2 nutch supergroup &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 2009-12-04 14:04 /url.txt
&lt;br&gt;&amp;gt; nutch@ip-10-224-113-210:/nutch/search&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I get the normal answer but the file is empty.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -----Original Message-----
&lt;br&gt;&amp;gt; From: MilleBii [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26645189&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;]
&lt;br&gt;&amp;gt; Sent: Freitag, 4. Dezember 2009 15:06
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26645189&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Subject: Re: Problems with a new Installation of Nutch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Did you check with the web interface ? It gives a lot of info you can
&lt;br&gt;&amp;gt; even browse the file system.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Try hadoop fs -ls to see what it gives you ?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26645189&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt; Hallo,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&lt;br&gt;&amp;gt;&amp;gt; can't put data in the hdfs.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I formatted the namenode and start the hdfs with start all.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp;All &amp;nbsp;java processes start properly, but when I want to make hadoop fs
&lt;br&gt;&amp;gt;&amp;gt; -put something / I get these logs:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; nutch@bla:/nutch/search&amp;gt; ./bin/hadoop fs -put
&lt;br&gt;&amp;gt;&amp;gt; /tmp/hadoop-nutch-tasktracker.pid blub
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; put: Protocol not available
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; DATA NODE LOG on the master:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:15,566 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:15,582 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,483 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,614 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,882 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:16,883 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:17,827 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:17,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@36527f
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; NAME NODE LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:11,539 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:11,573 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,488 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,565 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,569 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,582 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50070
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@173ec72
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; SECONDARY NAMENODE LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:19,163 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:19,207 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:20,365 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:20,454 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:21,533 INFO &amp;nbsp;servlet.XMLConfiguration - No
&lt;br&gt;&amp;gt;&amp;gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&lt;br&gt;&amp;gt;&amp;gt; files and default/dynamic servlets only
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,206 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,785 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50090
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@297ffb
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt;&amp;gt; Period &amp;nbsp; :3600 secs (60 min)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Log Size
&lt;br&gt;&amp;gt;&amp;gt; Trigger &amp;nbsp; &amp;nbsp;:67108864 bytes (65536 KB)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:55:23,908 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt;&amp;gt; done. New Image Size: 1056
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; HADOOP LOG
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,708 WARN &amp;nbsp;hdfs.DFSClient - DataStreamer Exception:
&lt;br&gt;&amp;gt;&amp;gt; java.io.IOException: Unable to create new block.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&lt;br&gt;&amp;gt;&amp;gt; FSClient.java:2722)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&lt;br&gt;&amp;gt;&amp;gt; ava:1996)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&lt;br&gt;&amp;gt;&amp;gt; ent.java:2183)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Error Recovery for block
&lt;br&gt;&amp;gt;&amp;gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Could not get block
&lt;br&gt;&amp;gt;&amp;gt; locations. Source file &amp;quot;/user/nutch/blub/hadoop-nutch-tasktracker.pid&amp;quot; -
&lt;br&gt;&amp;gt;&amp;gt; Aborting...
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; DATA NODE LOG on the slave
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:49,433 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:49,438 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,288 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,357 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,816 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,820 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt;&amp;gt; org.mortbay.jetty.Server@b02928
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; HADOOP SITE XML
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(yes here is the right ip):9000&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; The name of the default file system. Either the literal string
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot; or a host:port for NDFS.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&lt;br&gt;&amp;gt;&amp;gt; zu finden ist. --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(here to):9001&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; The host and port that the MapReduce job tracker runs at. If
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot;, then jobs are run in-process as a single map and
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; reduce task.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.map.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.map tasks to be number of slave hosts
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.reduce.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.reduce tasks to be number of slave hosts
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;-Xmx1500m&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.jobtracker.restart.recover&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&lt;br&gt;&amp;gt;&amp;gt; auf der Festplatte jeder Instanz speichert. --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/name&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/data&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/system&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/local&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&lt;br&gt;&amp;gt;&amp;gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Thanks
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Tom
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; -MilleBii-
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Problems-with-a-new-Installation-of-Nutch-tp26641822p26645189.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26644288</id>
	<title>RE: How to force recrawl of everything</title>
	<published>2009-12-04T07:36:09Z</published>
	<updated>2009-12-04T07:36:09Z</updated>
	<author>
		<name>Peters, Vijaya</name>
	</author>
	<content type="html">&lt;br&gt;Running:
&lt;br&gt;bin/nutch readdb crawldb -url &amp;lt;url&amp;gt; I got the following exception.
&lt;br&gt;Also, how do I force a recrawl in Nutch 1.0?
&lt;br&gt;&lt;br&gt;&lt;br&gt;Exception in thread &amp;quot;main&amp;quot; java.lang.ArithmeticException: / by zero
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartiti
&lt;br&gt;oner.java:32)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFo
&lt;br&gt;rmat.java:104)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:380)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:386)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:511)
&lt;br&gt;&lt;br&gt;Vijaya Peters
&lt;br&gt;SRA International, Inc.
&lt;br&gt;4350 Fair Lakes Court North
&lt;br&gt;Room 4004
&lt;br&gt;Fairfax, VA &amp;nbsp;22033
&lt;br&gt;Tel: &amp;nbsp;703-502-1184
&lt;br&gt;&lt;br&gt;www.sra.com
&lt;br&gt;Named to FORTUNE's &amp;quot;100 Best Companies to Work For&amp;quot; list for 10
&lt;br&gt;consecutive years
&lt;br&gt;P Please consider the environment before printing this e-mail
&lt;br&gt;This electronic message transmission contains information from SRA
&lt;br&gt;International, Inc. which may be confidential, privileged or
&lt;br&gt;proprietary. &amp;nbsp;The information is intended for the use of the individual
&lt;br&gt;or entity named above. &amp;nbsp;If you are not the intended recipient, be aware
&lt;br&gt;that any disclosure, copying, distribution, or use of the contents of
&lt;br&gt;this information is strictly prohibited. &amp;nbsp;If you have received this
&lt;br&gt;electronic information in error, please notify us immediately by
&lt;br&gt;telephone at 866-584-2143.
&lt;br&gt;-----Original Message-----
&lt;br&gt;From: reinhard schwab [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26644288&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;reinhard.schwab@...&lt;/a&gt;] 
&lt;br&gt;Sent: Friday, December 04, 2009 8:32 AM
&lt;br&gt;To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26644288&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;Subject: Re: How to force recrawl of everything
&lt;br&gt;&lt;br&gt;Peters, Vijaya schrieb:
&lt;br&gt;&amp;gt; I am using Nutch 1.0. &amp;nbsp;I want to perform a 'clean' crawl. &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I see the force option in this patch: &amp;nbsp;NUTCH-601v1.0.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;lt;&lt;a href=&quot;https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0&lt;/a&gt;&lt;br&gt;&amp;gt; .patch&amp;gt; 
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Do I have to make those code changes, or does Nutch 1.0 have another
&lt;br&gt;way
&lt;br&gt;&amp;gt; to do this?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Also, everytime I do another crawl, I see the same file being fetched
&lt;br&gt;&amp;gt; over and over again. Is it appending the same url over and over to the
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;which file?
&lt;br&gt;you can check the crawl date of this file with
&lt;br&gt;&lt;br&gt;reinhard@thord:&amp;gt;bin/nutch readdb &amp;nbsp;&amp;lt;crawldb&amp;gt; &amp;nbsp; -url &amp;lt;url&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; fetch list?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; - Vijaya
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Vijaya Peters
&lt;br&gt;&amp;gt; SRA International, Inc.
&lt;br&gt;&amp;gt; 4350 Fair Lakes Court North
&lt;br&gt;&amp;gt; Room 4004
&lt;br&gt;&amp;gt; Fairfax, VA &amp;nbsp;22033
&lt;br&gt;&amp;gt; Tel: &amp;nbsp;703-502-1184
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; www.sra.com &amp;lt;&lt;a href=&quot;http://www.sra.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sra.com/&lt;/a&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Named to FORTUNE's &amp;quot;100 Best Companies to Work For&amp;quot; list for 10
&lt;br&gt;&amp;gt; consecutive years
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; P Please consider the environment before printing this e-mail
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; This electronic message transmission contains information from SRA
&lt;br&gt;&amp;gt; International, Inc. which may be confidential, privileged or
&lt;br&gt;&amp;gt; proprietary. &amp;nbsp;The information is intended for the use of the
&lt;/div&gt;individual
&lt;br&gt;&amp;gt; or entity named above. &amp;nbsp;If you are not the intended recipient, be
&lt;br&gt;aware
&lt;br&gt;&amp;gt; that any disclosure, copying, distribution, or use of the contents of
&lt;br&gt;&amp;gt; this information is strictly prohibited. &amp;nbsp;If you have received this
&lt;br&gt;&amp;gt; electronic information in error, please notify us immediately by
&lt;br&gt;&amp;gt; telephone at 866-584-2143.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-force-recrawl-of-everything-tp26642378p26644288.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26644263</id>
	<title>RE: Problems with a new Installation of Nutch</title>
	<published>2009-12-04T07:35:08Z</published>
	<updated>2009-12-04T07:35:08Z</updated>
	<author>
		<name>Tom Landvoigt</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;I don't have tomcat on this system because I don't want to use the websearch. But if it is necessary for hadoop what I don’t think I will install it.
&lt;br&gt;&lt;br&gt;nutch@ip-10-224-113-210:/nutch/search&amp;gt; ./bin/hadoop fs -ls /
&lt;br&gt;Found 1 items
&lt;br&gt;-rw-r--r-- &amp;nbsp; 2 nutch supergroup &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 2009-12-04 14:04 /url.txt
&lt;br&gt;nutch@ip-10-224-113-210:/nutch/search&amp;gt;
&lt;br&gt;&lt;br&gt;I get the normal answer but the file is empty.
&lt;br&gt;&lt;br&gt;-----Original Message-----
&lt;br&gt;From: MilleBii [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26644263&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;] 
&lt;br&gt;Sent: Freitag, 4. Dezember 2009 15:06
&lt;br&gt;To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26644263&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;Subject: Re: Problems with a new Installation of Nutch
&lt;br&gt;&lt;br&gt;Did you check with the web interface ? It gives a lot of info you can
&lt;br&gt;even browse the file system.
&lt;br&gt;&lt;br&gt;Try hadoop fs -ls to see what it gives you ?
&lt;br&gt;&lt;br&gt;2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26644263&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hallo,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&lt;br&gt;&amp;gt; can't put data in the hdfs.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I formatted the namenode and start the hdfs with start all.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;All &amp;nbsp;java processes start properly, but when I want to make hadoop fs
&lt;br&gt;&amp;gt; -put something / I get these logs:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; nutch@bla:/nutch/search&amp;gt; ./bin/hadoop fs -put
&lt;br&gt;&amp;gt; /tmp/hadoop-nutch-tasktracker.pid blub
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; put: Protocol not available
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; DATA NODE LOG on the master:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:15,566 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:15,582 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,483 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,614 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,882 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,883 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:17,827 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:17,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@36527f
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; NAME NODE LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:11,539 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:11,573 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,488 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,565 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,569 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,582 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50070
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@173ec72
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; SECONDARY NAMENODE LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:19,163 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:19,207 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:20,365 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:20,454 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,533 INFO &amp;nbsp;servlet.XMLConfiguration - No
&lt;br&gt;&amp;gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&lt;br&gt;&amp;gt; files and default/dynamic servlets only
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,206 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,785 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50090
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@297ffb
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt; Period &amp;nbsp; :3600 secs (60 min)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Log Size
&lt;br&gt;&amp;gt; Trigger &amp;nbsp; &amp;nbsp;:67108864 bytes (65536 KB)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:55:23,908 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt; done. New Image Size: 1056
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; HADOOP LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,708 WARN &amp;nbsp;hdfs.DFSClient - DataStreamer Exception:
&lt;br&gt;&amp;gt; java.io.IOException: Unable to create new block.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&lt;br&gt;&amp;gt; FSClient.java:2722)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&lt;br&gt;&amp;gt; ava:1996)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&lt;br&gt;&amp;gt; ent.java:2183)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Error Recovery for block
&lt;br&gt;&amp;gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Could not get block
&lt;br&gt;&amp;gt; locations. Source file &amp;quot;/user/nutch/blub/hadoop-nutch-tasktracker.pid&amp;quot; -
&lt;br&gt;&amp;gt; Aborting...
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; DATA NODE LOG on the slave
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:49,433 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:49,438 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,288 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,357 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,816 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,820 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@b02928
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; HADOOP SITE XML
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(yes here is the right ip):9000&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; The name of the default file system. Either the literal string
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot; or a host:port for NDFS.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&lt;br&gt;&amp;gt; zu finden ist. --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(here to):9001&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; The host and port that the MapReduce job tracker runs at. If
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot;, then jobs are run in-process as a single map and
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; reduce task.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.map.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.map tasks to be number of slave hosts
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.reduce.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.reduce tasks to be number of slave hosts
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;-Xmx1500m&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.jobtracker.restart.recover&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&lt;br&gt;&amp;gt; auf der Festplatte jeder Instanz speichert. --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/name&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/data&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/system&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/local&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&lt;br&gt;&amp;gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Tom
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Problems-with-a-new-Installation-of-Nutch-tp26641822p26644263.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643852</id>
	<title>unsubscribe from nutch-user</title>
	<published>2009-12-04T07:07:02Z</published>
	<updated>2009-12-04T07:07:02Z</updated>
	<author>
		<name>Lukas, Ray</name>
	</author>
	<content type="html">Well three is a charm.. I need to move these to a different email as
&lt;br&gt;well.. Please if you could.. Could we also remove this email address as
&lt;br&gt;well.. 
&lt;br&gt;Thanks 
&lt;br&gt;ray
&lt;br&gt;&lt;br&gt;-----Original Message-----
&lt;br&gt;From: M S Ram [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643852&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;msram@...&lt;/a&gt;] 
&lt;br&gt;Sent: Friday, December 04, 2009 10:01 AM
&lt;br&gt;To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643852&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;Subject: Re: unsubscribe from nutch-user
&lt;br&gt;&lt;br&gt;Same here. Please remove my ID also from the mailing list.
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;MSR
&lt;br&gt;&lt;br&gt;rengan xu wrote:
&lt;br&gt;&amp;gt; To whom it may concern,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Hello! Because I will use this E-mail for special purpose. I will use
&lt;br&gt;&amp;gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe
&lt;br&gt;from
&lt;br&gt;&amp;gt; nutch-user.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thank you!
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/unsubscribe-from-nutch-user-tp26643628p26643852.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643844</id>
	<title>Re: unsubscribe from nutch-user</title>
	<published>2009-12-04T07:06:36Z</published>
	<updated>2009-12-04T07:06:36Z</updated>
	<author>
		<name>prashant ullegaddi-2</name>
	</author>
	<content type="html">Take a look at it:
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/nutch/mailing_lists.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/mailing_lists.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;or
&lt;br&gt;probably sending a blank mail to:
&lt;br&gt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643844&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user-unsubscribe@...&lt;/a&gt; also
&lt;br&gt;work.
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;Prashant.
&lt;br&gt;&lt;br&gt;On Fri, Dec 4, 2009 at 8:30 PM, M S Ram &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643844&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;msram@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Same here. Please remove my ID also from the mailing list.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks,
&lt;br&gt;&amp;gt; MSR
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; rengan xu wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; To whom it may concern,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Hello! Because I will use this E-mail for special purpose. I will use
&lt;br&gt;&amp;gt;&amp;gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&lt;br&gt;&amp;gt;&amp;gt; nutch-user.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Thank you!
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Thanks,
&lt;br&gt;Prashant Ullegaddi,
&lt;br&gt;Search and Information Extraction Lab,
&lt;br&gt;IIIT-Hyderabad, INDIA.
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/unsubscribe-from-nutch-user-tp26643628p26643844.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643804</id>
	<title>Re: unsubscribe from nutch-user</title>
	<published>2009-12-04T07:00:51Z</published>
	<updated>2009-12-04T07:00:51Z</updated>
	<author>
		<name>M S Ram</name>
	</author>
	<content type="html">Same here. Please remove my ID also from the mailing list.
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;MSR
&lt;br&gt;&lt;br&gt;rengan xu wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; To whom it may concern,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Hello! Because I will use this E-mail for special purpose. I will use
&lt;br&gt;&amp;gt; another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&lt;br&gt;&amp;gt; nutch-user.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thank you!
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/unsubscribe-from-nutch-user-tp26643628p26643804.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643628</id>
	<title>unsubscribe from nutch-user</title>
	<published>2009-12-04T06:50:48Z</published>
	<updated>2009-12-04T06:50:48Z</updated>
	<author>
		<name>rengan xu</name>
	</author>
	<content type="html">To whom it may concern,
&lt;br&gt;&lt;br&gt;Hello! Because I will use this E-mail for special purpose. I will use
&lt;br&gt;another E-mail to subscribe in nutch-user. So I want to unsubscribe from
&lt;br&gt;nutch-user.
&lt;br&gt;&lt;br&gt;Thank you!
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Regards,
&lt;br&gt;Xu, Rengan
&lt;br&gt;&lt;br&gt;School of Computer Science and Information Engineering,
&lt;br&gt;Hefei University of Technology, Hefei 230009, China
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/unsubscribe-from-nutch-user-tp26643628p26643628.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643195</id>
	<title>Re: Can nutch pause, stop and start where it left off?</title>
	<published>2009-12-04T06:19:16Z</published>
	<updated>2009-12-04T06:19:16Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Nutch behaves ...
&lt;br&gt;So by default it will not fetch more 1 url every 5s (setting
&lt;br&gt;changeable) &amp;nbsp;to a given host (by name or ip depending on the nutch
&lt;br&gt;conf file).
&lt;br&gt;So actually you will find the opposite it is very slow for a single
&lt;br&gt;site... Speed comes when you fetch several sites in parallel.
&lt;br&gt;&lt;br&gt;&lt;br&gt;2009/12/4, Jesse Hires &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643195&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;jhires@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; use the -topN flag to only grab a small number of URLs.
&lt;br&gt;&amp;gt; Also I believe there is also a setting you can put in nutch-site.xml that
&lt;br&gt;&amp;gt; can be used to slow down how many URLs you grab over time.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Jesse
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; int GetRandomNumber()
&lt;br&gt;&amp;gt; {
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;return 4; // Chosen by fair roll of dice
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // Guaranteed to be random
&lt;br&gt;&amp;gt; } // xkcd.com
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643195&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mrhadoop@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I am just staring to learn nutch. &amp;nbsp;One question I wanted to know is that
&lt;br&gt;&amp;gt;&amp;gt; can
&lt;br&gt;&amp;gt;&amp;gt; nutch pause, stop and start indexing a site on a incremental &amp;nbsp;daily basis?
&lt;br&gt;&amp;gt;&amp;gt; My concern with nutch is that nutch behaving like a hog and crawling
&lt;br&gt;&amp;gt;&amp;gt; everything with huge bandwidth consumption and pissing off the many site
&lt;br&gt;&amp;gt;&amp;gt; owners.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Can some experts shed some light in this?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Can-nutch-pause%2C-stop-and-start-where-it-left-off--tp26641693p26643195.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26643017</id>
	<title>Re: Problems with a new Installation of Nutch</title>
	<published>2009-12-04T06:05:50Z</published>
	<updated>2009-12-04T06:05:50Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Did you check with the web interface ? It gives a lot of info you can
&lt;br&gt;even browse the file system.
&lt;br&gt;&lt;br&gt;Try hadoop fs -ls to see what it gives you ?
&lt;br&gt;&lt;br&gt;2009/12/4, Tom Landvoigt &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26643017&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;tom.landvoigt@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hallo,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I installed nutch on 2 Amazon EC2 computers. Everything is fine but I
&lt;br&gt;&amp;gt; can't put data in the hdfs.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I formatted the namenode and start the hdfs with start all.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;All &amp;nbsp;java processes start properly, but when I want to make hadoop fs
&lt;br&gt;&amp;gt; -put something / I get these logs:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; nutch@bla:/nutch/search&amp;gt; ./bin/hadoop fs -put
&lt;br&gt;&amp;gt; /tmp/hadoop-nutch-tasktracker.pid blub
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; put: Protocol not available
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; DATA NODE LOG on the master:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:15,566 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:15,582 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,483 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,614 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,882 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1284fd4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:16,883 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:17,827 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@39c8c1
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:17,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:18,485 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@36527f
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,745 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,746 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,747 ERROR datanode.DataNode -
&lt;br&gt;&amp;gt; DatanodeRegistration(10.224.113.210:50010,
&lt;br&gt;&amp;gt; storageID=DS-1135263253-10.224.113.210-50010-1259926637370,
&lt;br&gt;&amp;gt; infoPort=50075, ipcPort=50020):DataXceiver
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; java.io.EOFException
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readShort(DataInputStream.java:315)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
&lt;br&gt;&amp;gt; 79)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:636)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; NAME NODE LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:11,539 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:11,573 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,488 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@19fe451
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,565 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@1570945
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:12,891 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,569 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@11410e5
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,582 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50070
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:13,613 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@173ec72
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; SECONDARY NAMENODE LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:19,163 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:19,207 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:20,365 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@174d93a
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:20,454 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@31f2a7
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,396 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:21,533 INFO &amp;nbsp;servlet.XMLConfiguration - No
&lt;br&gt;&amp;gt; WEB-INF/web.xml in file:/mnt/nutch/nutch-1.0/webapps/secondary. Serving
&lt;br&gt;&amp;gt; files and default/dynamic servlets only
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,206 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@383118
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,785 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50090
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@297ffb
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt; Period &amp;nbsp; :3600 secs (60 min)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:50:22,787 WARN &amp;nbsp;namenode.SecondaryNameNode - Log Size
&lt;br&gt;&amp;gt; Trigger &amp;nbsp; &amp;nbsp;:67108864 bytes (65536 KB)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:55:23,908 WARN &amp;nbsp;namenode.SecondaryNameNode - Checkpoint
&lt;br&gt;&amp;gt; done. New Image Size: 1056
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; HADOOP LOG
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,708 WARN &amp;nbsp;hdfs.DFSClient - DataStreamer Exception:
&lt;br&gt;&amp;gt; java.io.IOException: Unable to create new block.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
&lt;br&gt;&amp;gt; FSClient.java:2722)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
&lt;br&gt;&amp;gt; ava:1996)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt; org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
&lt;br&gt;&amp;gt; ent.java:2183)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Error Recovery for block
&lt;br&gt;&amp;gt; blk_5506837520665828594_1002 bad datanode[0] nodes == null
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:54:20,709 WARN &amp;nbsp;hdfs.DFSClient - Could not get block
&lt;br&gt;&amp;gt; locations. Source file &amp;quot;/user/nutch/blub/hadoop-nutch-tasktracker.pid&amp;quot; -
&lt;br&gt;&amp;gt; Aborting...
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; DATA NODE LOG on the slave
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:49,433 INFO &amp;nbsp;http.HttpServer - Version Jetty/5.1.4
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:49,438 INFO &amp;nbsp;util.Credential - Checking Resource
&lt;br&gt;&amp;gt; aliases
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,288 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@e45b5e
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,357 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/static,/static]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@2016b0
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,555 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/logs,/logs]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,816 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.servlet.WebApplicationHandler@118278a
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,820 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; WebApplicationContext[/,/]
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;http.SocketListener - Started
&lt;br&gt;&amp;gt; SocketListener on 0.0.0.0:50075
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-12-04 12:49:50,849 INFO &amp;nbsp;util.Container - Started
&lt;br&gt;&amp;gt; org.mortbay.jetty.Server@b02928
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; HADOOP SITE XML
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(yes here is the right ip):9000&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; The name of the default file system. Either the literal string
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot; or a host:port for NDFS.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wo der JobTracker (koordiniert die (MapReduce-)Auftraege)
&lt;br&gt;&amp;gt; zu finden ist. --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;hdfs://(here to):9001&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; The host and port that the MapReduce job tracker runs at. If
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;quot;local&amp;quot;, then jobs are run in-process as a single map and
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; reduce task.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele MapJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.map.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.map tasks to be number of slave hosts
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele ReduceJobs gleichzeitig laufen duerfen--&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.tasktracker.reduce.tasks.maximum&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; define mapred.reduce tasks to be number of slave hosts
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;-Xmx1500m&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.jobtracker.restart.recover&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Die naechsten Einstellungen geben an wo das HadoopFS seine Datein
&lt;br&gt;&amp;gt; auf der Festplatte jeder Instanz speichert. --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/name&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/data&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/system&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/local&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;!-- Gibt an wieviele Replikate einer Datei im Dateisystem vorhanden
&lt;br&gt;&amp;gt; sein muessen damit sie erreichbar ist. Am Anfang 1 --&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I hope someone can help me.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Tom
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Problems-with-a-new-Installation-of-Nutch-tp26641822p26643017.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26642876</id>
	<title>Re: Can nutch pause, stop and start where it left off?</title>
	<published>2009-12-04T05:56:53Z</published>
	<updated>2009-12-04T05:56:53Z</updated>
	<author>
		<name>Jesse Hires</name>
	</author>
	<content type="html">use the -topN flag to only grab a small number of URLs.
&lt;br&gt;Also I believe there is also a setting you can put in nutch-site.xml that
&lt;br&gt;can be used to slow down how many URLs you grab over time.
&lt;br&gt;&lt;br&gt;Jesse
&lt;br&gt;&lt;br&gt;int GetRandomNumber()
&lt;br&gt;{
&lt;br&gt;&amp;nbsp; &amp;nbsp;return 4; // Chosen by fair roll of dice
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // Guaranteed to be random
&lt;br&gt;} // xkcd.com
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26642876&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mrhadoop@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&lt;br&gt;&amp;gt; I am just staring to learn nutch. &amp;nbsp;One question I wanted to know is that
&lt;br&gt;&amp;gt; can
&lt;br&gt;&amp;gt; nutch pause, stop and start indexing a site on a incremental &amp;nbsp;daily basis?
&lt;br&gt;&amp;gt; My concern with nutch is that nutch behaving like a hog and crawling
&lt;br&gt;&amp;gt; everything with huge bandwidth consumption and pissing off the many site
&lt;br&gt;&amp;gt; owners.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Can some experts shed some light in this?
&lt;br&gt;&amp;gt;
&lt;br&gt;&lt;p&gt;From forum: &lt;a href=&quot;http://old.nabble.com/Nutch---User-f375.html&quot; embed=&quot;fixTarget[375]&quot; target=&quot;_top&quot; &gt;Nutch - User&lt;/a&gt;&lt;/p&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Can-nutch-pause%2C-stop-and-start-where-it-left-off--tp26641693p26642876.html" />
</entry>

</feed>
