<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-375</id>
	<title>Nabble - Nutch - User</title>
	<updated>2009-11-27T05:56:30Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Nutch---User-f375.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch---User-f375.html" />
	<subtitle type="html"></subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26542690</id>
	<title>Re: Nutch indexes less pages, then it fetches</title>
	<published>2009-11-27T05:56:30Z</published>
	<updated>2009-11-27T05:56:30Z</updated>
	<author>
		<name>J. Smith</name>
	</author>
	<content type="html">Does anybody know how to solve this problem?
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26539901</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-27T01:35:28Z</published>
	<updated>2009-11-27T01:35:28Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">MilleBii wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Interesting updates on the current run of 450K urls :
&lt;br&gt;&amp;gt; + 30minutes @ 3Mbits/s
&lt;br&gt;&amp;gt; + drop to 1Mbit/s (1/X shape)
&lt;br&gt;&amp;gt; + gradual improvement to 1.5 Mbit/s and steady for 7 hours
&lt;br&gt;&amp;gt; + sudden drop to 0.9 Mbits/s and steady for 4 hours
&lt;br&gt;&amp;gt; + up to 1.7 Mbits for 1hour
&lt;br&gt;&amp;gt; + staircasing down to 0.5 Mbit/s by steps of 1 hour
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I don't know what to take as a conclusion, but it is quite strange to have
&lt;br&gt;&amp;gt; those sudden variation of bandwidth and overall very slow.
&lt;br&gt;&amp;gt; I can post the graph if people are interested.
&lt;/div&gt;&lt;br&gt;This most likely comes from the allocation of urls to map tasks, and the 
&lt;br&gt;maximum number of map tasks that you can run on your cluster. when tasks 
&lt;br&gt;finish their run, you see a sudden drop in speed, until the next task 
&lt;br&gt;starts running. Initially, I suspect that you have more tasks available 
&lt;br&gt;than the capacity of your cluster, so it's easy to fill the slots and 
&lt;br&gt;max the speed. Later on, slow map tasks tend to hang around, but still 
&lt;br&gt;some of them finish and make space for new tasks. As time goes on, 
&lt;br&gt;majority of your tasks becomes slow tasks, so the overall speed 
&lt;br&gt;continues to drop down.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26539901.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26539695</id>
	<title>Re: Encoding the content got from Fetcher</title>
	<published>2009-11-27T01:16:55Z</published>
	<updated>2009-11-27T01:16:55Z</updated>
	<author>
		<name>Santiago Pérez</name>
	</author>
	<content type="html">I had already tried with: 
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;parser.character.encoding.default&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;UTF-8&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;description&amp;gt;The character encoding to fall back to when no other information
&lt;br&gt;&amp;nbsp; is available&amp;lt;/description&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;and System.out.println(content.toString());
&lt;br&gt;is still the HTML code with the incorrect encoding...</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26539695.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26539406</id>
	<title>Re: Encoding the content got from Fetcher</title>
	<published>2009-11-27T00:45:46Z</published>
	<updated>2009-11-27T00:45:46Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Santiago Pérez wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Yes, I tried in that configuration file setting with the latin encoding
&lt;br&gt;&amp;gt; Windows-1250, but the value of this property does not affect to the encoding
&lt;br&gt;&amp;gt; of the content (I also tried with unexistent encoding and the result is the
&lt;br&gt;&amp;gt; same...)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;name&amp;gt;parser.character.encoding.default&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;value&amp;gt;Windows-1250&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;lt;description&amp;gt;The character encoding to fall back to when no other
&lt;br&gt;&amp;gt; information
&lt;br&gt;&amp;gt; &amp;nbsp; is available&amp;lt;/description&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/property&amp;gt;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Has anyone had the same problem? (Hungarian o Polish people sure...)
&lt;/div&gt;&lt;br&gt;The appearance of characters that you quoted in your other email 
&lt;br&gt;indicates that the problem may be the opposite - your pages seem to use 
&lt;br&gt;UTF-8, and you are trying to convert them using Windows-1250 ... Try 
&lt;br&gt;putting UTF-8 in this property, and see what happens.
&lt;br&gt;&lt;br&gt;Generally speaking, pages should declare their encoding, either in HTTP 
&lt;br&gt;headers or in &amp;lt;meta&amp;gt; tags, but often this declaration is either missing 
&lt;br&gt;or completely wrong. Nutch uses ICU4J CharsetDetector plus its own 
&lt;br&gt;heuristic (in util.EncodingDetector and in HtmlParser) that tries to 
&lt;br&gt;detect character encoding if it's missing or even if it's wrong - but 
&lt;br&gt;this is a tricky issue and sometimes results are unpredictable.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26539406.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26536269</id>
	<title>Re: Encoding the content got from Fetcher</title>
	<published>2009-11-27T00:17:04Z</published>
	<updated>2009-11-27T00:17:04Z</updated>
	<author>
		<name>Santiago Pérez</name>
	</author>
	<content type="html">Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...)
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;parser.character.encoding.default&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;Windows-1250&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;description&amp;gt;The character encoding to fall back to when no other information
&lt;br&gt;&amp;nbsp; is available&amp;lt;/description&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;Has anyone had the same problem? (Hungarian o Polish people sure...)
&lt;br&gt;&lt;br&gt;Thanks</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26536269.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26538898</id>
	<title>Re: Nutch near future - strategic directions</title>
	<published>2009-11-26T23:50:38Z</published>
	<updated>2009-11-26T23:50:38Z</updated>
	<author>
		<name>Sami Siren-2</name>
	</author>
	<content type="html">Andrzej Bialecki wrote:
&lt;br&gt;&amp;gt; Sami Siren wrote:
&lt;br&gt;&amp;gt;&amp;gt; Lots of good thoughts and ideas, easy to agree with.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Something for the &amp;quot;ease of use&amp;quot; category:
&lt;br&gt;&amp;gt;&amp;gt; -allow running on top of plain vanilla hadoop
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; What does it mean &amp;quot;plain vanilla&amp;quot; here? Do you mean the current DB 
&lt;br&gt;&amp;gt; implementation? That's the idea, we should aim for an abstract layer 
&lt;br&gt;&amp;gt; that can accommodate both HBase and plain MapFile-s.
&lt;br&gt;&lt;br&gt;I was simply trying to say that we should not bundle Hadoop anymore with 
&lt;br&gt;Nutch and instead just mention the specific version it should run on top 
&lt;br&gt;of as a requirement. I am not totally sure anymore if this is a good idea...
&lt;br&gt;&lt;br&gt;I do not know details about the HBase branch. Would using HBase allow us 
&lt;br&gt;easy migration from &amp;nbsp;one data model to another (without complex code we 
&lt;br&gt;now have in our datums). How easy is HBase to manage/setup/configure?
&lt;br&gt;&lt;br&gt;I think Avro looks promising as a data storage technology: has some 
&lt;br&gt;support for data model evolution, can be accessed &amp;quot;natively&amp;quot; from many 
&lt;br&gt;programming languages, is relatively well performing... The downside at 
&lt;br&gt;the moment is that it is not yet fully supported by hadoop mapred (I think).
&lt;br&gt;&lt;br&gt;&amp;gt;&amp;gt; -split into reusable components with nice and clean public api
&lt;br&gt;&amp;gt;&amp;gt; -publish mvn artifacts so developers can directly use mvn, ivy etc to 
&lt;br&gt;&amp;gt;&amp;gt; pull required dependencies for their specific crawler
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; +1, with slight preference towards ivy.
&lt;br&gt;&lt;br&gt;I was not clear here, I think I was referring to users of Nutch instead 
&lt;br&gt;of Developers. And in that case the choise of a tool would be up to the 
&lt;br&gt;user after the artifacts are in the repo.
&lt;br&gt;&lt;br&gt;Also, I think what I wanted to day is more about the model how would 
&lt;br&gt;people that want to do some customization operate instead of a 
&lt;br&gt;technology choice.
&lt;br&gt;&lt;br&gt;Creating new plugin:
&lt;br&gt;-create your own build configuration (or use a template we provide)
&lt;br&gt;-implement plugin code
&lt;br&gt;-publish to m2 repository
&lt;br&gt;&lt;br&gt;Creating your custom crawler:
&lt;br&gt;-create your own build configuration (or use a template we might 
&lt;br&gt;provide), specify the dependencies you need (plugins basically, from 
&lt;br&gt;apache or from anybody else as long as they are available through some 
&lt;br&gt;repository)
&lt;br&gt;-potentially write some custom code
&lt;br&gt;&lt;br&gt;We could also still provide a &amp;quot;default&amp;quot; Nutch crawler also, as a build 
&lt;br&gt;configuration (basically just xml file + some config) if we wanted.
&lt;br&gt;&lt;br&gt;The new Hadoop maven artifacts also help with this vision since we could 
&lt;br&gt;also access hadoop apis (and dependencies) through similar mechanism.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;&amp;gt; My biggest concern is in execution of this (or any other) plan.
&lt;br&gt;&amp;gt;&amp;gt; Some of the changes or improvements that have been proposed are quite 
&lt;br&gt;&amp;gt;&amp;gt; &amp;quot;heavy&amp;quot; in nature and would require large changes. I am just thinking 
&lt;br&gt;&amp;gt;&amp;gt; that would it still be better to take a fresh start instead of trying 
&lt;br&gt;&amp;gt;&amp;gt; to do this incrementally on top of existing code base.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Well ... that's (almost) what Dogacan did with the HBase port. I agree 
&lt;br&gt;&amp;gt; that we should not feel too constrained by the existing code base, but 
&lt;br&gt;&amp;gt; it would be silly to throw everything away and start from scratch - we 
&lt;br&gt;&amp;gt; need to find a middle ground. The crawler-commons and Tika projects 
&lt;br&gt;&amp;gt; should help us to get rid of the ballast and significantly reduce the 
&lt;br&gt;&amp;gt; size of our code.
&lt;/div&gt;&lt;br&gt;I am not aiming to throw everything away, just trying to relax the back 
&lt;br&gt;compatibility burden and give &amp;quot;innovation&amp;quot; a chance.
&lt;br&gt;&lt;br&gt;&amp;gt;&amp;gt; In the history of Nutch this approach is not something new (remember 
&lt;br&gt;&amp;gt;&amp;gt; map reduce?) and in my opinion it worked nicely then. Perhaps it is 
&lt;br&gt;&amp;gt;&amp;gt; different this time since the changes we are discussing now have many 
&lt;br&gt;&amp;gt;&amp;gt; abstract things hanging in the air, even fundamental ones.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Nutch 0.7 to 0.8 reused a lot of the existing code.
&lt;br&gt;&lt;br&gt;I am hoping that this time it will not be different.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Of course the rewrite approach means that it will take some time 
&lt;br&gt;&amp;gt;&amp;gt; before we actually get into the point where we can start adding real 
&lt;br&gt;&amp;gt;&amp;gt; substance (meaning new features etc).
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; So to summarize, I would go ahead and put together a branch &amp;quot;nutch 
&lt;br&gt;&amp;gt;&amp;gt; N.0&amp;quot; that would consist of (a.k.a my wish list, hope I am not being 
&lt;br&gt;&amp;gt;&amp;gt; too aggressive here):
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; -runs on top of plain hadoop
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; See above - what do you mean by that?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; -use osgi (or some other more optimal extension mechanism that fits 
&lt;br&gt;&amp;gt;&amp;gt; and is easy to use)
&lt;br&gt;&amp;gt;&amp;gt; -basic http/https crawling functionality (with &amp;quot;db abstraction&amp;quot; or 
&lt;br&gt;&amp;gt;&amp;gt; hbase directly and smart data structures that allow flexible and 
&lt;br&gt;&amp;gt;&amp;gt; efficient usage of the data)
&lt;br&gt;&amp;gt;&amp;gt; -basic solr integration for indexing/search
&lt;br&gt;&amp;gt;&amp;gt; -basic parsing with tika
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; After the basics are ok we would start adding and promoting any of the 
&lt;br&gt;&amp;gt;&amp;gt; hidden gems we might have, or some solutions for the interesting 
&lt;br&gt;&amp;gt;&amp;gt; challenges.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I believe that's more or less where Dogacan's port is right now, except 
&lt;br&gt;&amp;gt; it's not merged with the OSGI port.
&lt;/div&gt;&lt;br&gt;Are you sure OSGI is the way to go? I Know it has all these nice 
&lt;br&gt;features and all but for some reason I feel that we could live with 
&lt;br&gt;something simpler. From functional pow: just drop your jars info 
&lt;br&gt;classpath and you're all set. So 2 changes here: 1. plugins are jars 2. 
&lt;br&gt;no individual classloaders for plugins.
&lt;br&gt;&lt;br&gt;--
&lt;br&gt;&amp;nbsp; Sami Siren
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-near-future---strategic-directions-tp26269228p26538898.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26538506</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-26T22:52:47Z</published>
	<updated>2009-11-26T22:52:47Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Interesting updates on the current run of 450K urls :
&lt;br&gt;+ 30minutes @ 3Mbits/s
&lt;br&gt;+ drop to 1Mbit/s (1/X shape)
&lt;br&gt;+ gradual improvement to 1.5 Mbit/s and steady for 7 hours
&lt;br&gt;+ sudden drop to 0.9 Mbits/s and steady for 4 hours
&lt;br&gt;+ up to 1.7 Mbits for 1hour
&lt;br&gt;+ staircasing down to 0.5 Mbit/s by steps of 1 hour
&lt;br&gt;&lt;br&gt;I don't know what to take as a conclusion, but it is quite strange to have
&lt;br&gt;those sudden variation of bandwidth and overall very slow.
&lt;br&gt;I can post the graph if people are interested.
&lt;br&gt;&lt;br&gt;&lt;br&gt;2009/11/26 MilleBii &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26538506&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Yep, I will try right after this run ends... Which is likely tomorrow
&lt;br&gt;&amp;gt; by the sound of it.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Still how come there is a factor 6+ difference from one run to the
&lt;br&gt;&amp;gt; next ... Timing hosts blocking the queue maybe, but the probability to
&lt;br&gt;&amp;gt; get one in the queue can not be so different from one run to run.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/11/26, Otis Gospodnetic &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26538506&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ogjunk-nutch@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt; I think in the end what Ken Krugler did with Bixo (limiting crawl time)
&lt;br&gt;&amp;gt; and
&lt;br&gt;&amp;gt; &amp;gt; what Julien added in &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770(plus&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770(plus&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;) are solutions to this
&lt;br&gt;&amp;gt; &amp;gt; problem, in addition to what Andrzej described below.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Can you try &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&amp;nbsp;and
&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;&amp;nbsp;?
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Otis
&lt;br&gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; Sematext is hiring -- &lt;a href=&quot;http://sematext.com/about/jobs.html?mls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sematext.com/about/jobs.html?mls&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; ----- Original Message ----
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; From: Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26538506&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26538506&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Sent: Wed, November 25, 2009 6:13:07 PM
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Subject: Re: 100 fetches per second?
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; I have to say that I'm still puzzled. Here is the latest. I just
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; restarted a
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; run and then guess what :
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could
&lt;br&gt;&amp;gt; only
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; get
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad.
&lt;br&gt;&amp;gt; But
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; why
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Don't fully understand why it is oscillating between two queue size
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; never
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the
&lt;br&gt;&amp;gt; content,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; i.e.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; requests don't time out, crawl-delay is low, but the download speed is
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; very very
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; low due to the limits at their end (either physical or artificial).
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; FetchQueue,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; and lock-out the queue if this number crosses the threshold (similarly
&lt;br&gt;&amp;gt; to
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; what
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; diagnostic
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; output, to see how many unique hosts are in the current working set.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; -- Best regards,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; -MilleBii-
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26538506.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26536875</id>
	<title>add parse-wml plugin to Nutch!</title>
	<published>2009-11-26T17:22:54Z</published>
	<updated>2009-11-26T17:22:54Z</updated>
	<author>
		<name>杨丰</name>
	</author>
	<content type="html">hi,
&lt;br&gt;&amp;nbsp; i have to add parse-wml plugin &amp;nbsp;to Nutch, &amp;nbsp;if it has been finished,pls
&lt;br&gt;give me some advise.
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;Tks!
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/add-parse-wml-plugin-to-Nutch%21-tp26536875p26536875.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26536092</id>
	<title>Re: Encoding the content got from Fetcher</title>
	<published>2009-11-26T15:11:11Z</published>
	<updated>2009-11-26T15:11:11Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">hi
&lt;br&gt;&lt;br&gt;have you tried to change this property:
&lt;br&gt;&lt;br&gt;parser.character.encoding.default
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Hej,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am a newbie in Nutch and I need some help with a problem because I do
&lt;br&gt;&amp;gt; not
&lt;br&gt;&amp;gt; find clear documentation.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In crawling proccess when the each of the FetcherThread get the content,
&lt;br&gt;&amp;gt; this is in formatted in a way which deletes the new line characters (&amp;quot;\n&amp;quot;)
&lt;br&gt;&amp;gt; and transform useful characters in Spanish as Ã¡,Ã©,Ã­,Ã³,Ãº,Ã±,Ã¼ in the
&lt;br&gt;&amp;gt; default
&lt;br&gt;&amp;gt; encoding like: Ã?ÃÂ¡, Ã?ÃÂ³, Ã?ÃÂ­, Ã?ÃÂ³, Ã?ÃÂº, Ã?ÃÂ±,
&lt;br&gt;&amp;gt; Ã?ÃÂ¼.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I would like to know if it is possible to set this default encoding (is
&lt;br&gt;&amp;gt; UTF-8?) to the one that I need (ASCII I guess).
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks in advance ;)
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; View this message in context:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html&lt;/a&gt;&lt;br&gt;&amp;gt; Sent from the Nutch - User mailing list archive at Nabble.com.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26536092.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26534598</id>
	<title>Re: Broken segments ?</title>
	<published>2009-11-26T12:34:03Z</published>
	<updated>2009-11-26T12:34:03Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Mischa Tuffield wrote:
&lt;br&gt;&amp;gt; Hello All,
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://people.apache.org/~hossman/#threadhijack&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://people.apache.org/~hossman/#threadhijack&lt;/a&gt;&lt;br&gt;&lt;br&gt;&amp;quot;When starting a new discussion on a mailing list, please do not reply 
&lt;br&gt;to an existing message, instead start a fresh email. &amp;nbsp;Even if you change 
&lt;br&gt;the subject line of your email, other mail headers still track which 
&lt;br&gt;thread you replied to and your question is &amp;quot;hidden&amp;quot; in that thread and 
&lt;br&gt;gets less attention. &amp;nbsp; It makes following discussions in the mailing 
&lt;br&gt;list archives particularly difficult.&amp;quot;
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26534598.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26534239</id>
	<title>Broken segments ?</title>
	<published>2009-11-26T11:54:48Z</published>
	<updated>2009-11-26T11:54:48Z</updated>
	<author>
		<name>Mischa@Garlik</name>
	</author>
	<content type="html">Hello All, 
&lt;br&gt;&lt;br&gt;I was wondering if there is any way to check the integrity of a segment? As it stands, I can't create the index I want due to a number of my segments freaking out like below : 
&lt;br&gt;&lt;br&gt;Is there anyway to check if my segments are OK, I guess i could always re:fetch them if need be.
&lt;br&gt;&lt;br&gt;Regards, and thanks in advance :)
&lt;br&gt;&lt;br&gt;Mischa
&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;lt;!--
&lt;br&gt;java.io.IOException: Could not obtain block: blk_8431627671702898365_95075 file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-00000
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readFully(DataInputStream.java:178)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.Child.main(Child.java:158)
&lt;br&gt;&lt;br&gt;...
&lt;br&gt;&lt;br&gt;java.io.IOException: Could not obtain block: blk_7970643458650610887_21674 file=/user/nutch/crawl/segments/20090618111426/content/part-00003/data
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readFully(DataInputStream.java:178)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.io.DataInputStream.readFully(DataInputStream.java:152)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.&amp;lt;init&amp;gt;(SequenceFile.java:1428)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.&amp;lt;init&amp;gt;(SequenceFile.java:1417)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.io.SequenceFile$Reader.&amp;lt;init&amp;gt;(SequenceFile.java:1412)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.Child.main(Child.java:158)
&lt;br&gt;--&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;On 26 Nov 2009, at 12:03, Santiago Pérez wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Hej,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I am a newbie in Nutch and I need some help with a problem because I do not
&lt;br&gt;&amp;gt; find clear documentation.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; In crawling proccess when the each of the FetcherThread get the content,
&lt;br&gt;&amp;gt; this is in formatted in a way which deletes the new line characters (&amp;quot;\n&amp;quot;)
&lt;br&gt;&amp;gt; and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
&lt;br&gt;&amp;gt; encoding like: Ã?Â¡, Ã?Â³, Ã?Â , Ã?Â³, Ã?Âº, Ã?Â±, Ã?Â¼.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I would like to know if it is possible to set this default encoding (is
&lt;br&gt;&amp;gt; UTF-8?) to the one that I need (ASCII I guess).
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Thanks in advance ;)
&lt;br&gt;&amp;gt; -- 
&lt;br&gt;&amp;gt; View this message in context: &lt;a href=&quot;http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html&lt;/a&gt;&lt;br&gt;&amp;gt; Sent from the Nutch - User mailing list archive at Nabble.com.
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;___________________________________
&lt;br&gt;Mischa Tuffield
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26534239&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;+44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26534239.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26534032</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-26T11:32:56Z</published>
	<updated>2009-11-26T11:32:56Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Yep, I will try right after this run ends... Which is likely tomorrow
&lt;br&gt;by the sound of it.
&lt;br&gt;&lt;br&gt;Still how come there is a factor 6+ difference from one run to the
&lt;br&gt;next ... Timing hosts blocking the queue maybe, but the probability to
&lt;br&gt;get one in the queue can not be so different from one run to run.
&lt;br&gt;&lt;br&gt;2009/11/26, Otis Gospodnetic &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26534032&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ogjunk-nutch@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I think in the end what Ken Krugler did with Bixo (limiting crawl time) and
&lt;br&gt;&amp;gt; what Julien added in &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&amp;nbsp;(plus
&lt;br&gt;&amp;gt; &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;) are solutions to this
&lt;br&gt;&amp;gt; problem, in addition to what Andrzej described below.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Can you try &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&amp;nbsp;and
&lt;br&gt;&amp;gt; &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;&amp;nbsp;?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Otis
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; Sematext is hiring -- &lt;a href=&quot;http://sematext.com/about/jobs.html?mls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sematext.com/about/jobs.html?mls&lt;/a&gt;&lt;br&gt;&amp;gt; Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; ----- Original Message ----
&lt;br&gt;&amp;gt;&amp;gt; From: Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26534032&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26534032&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt;&amp;gt; Sent: Wed, November 25, 2009 6:13:07 PM
&lt;br&gt;&amp;gt;&amp;gt; Subject: Re: 100 fetches per second?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; I have to say that I'm still puzzled. Here is the latest. I just
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; restarted a
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; run and then guess what :
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; get
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. But
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; why
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; Don't fully understand why it is oscillating between two queue size
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; never
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the content,
&lt;br&gt;&amp;gt;&amp;gt; i.e.
&lt;br&gt;&amp;gt;&amp;gt; requests don't time out, crawl-delay is low, but the download speed is
&lt;br&gt;&amp;gt;&amp;gt; very very
&lt;br&gt;&amp;gt;&amp;gt; low due to the limits at their end (either physical or artificial).
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per
&lt;br&gt;&amp;gt;&amp;gt; FetchQueue,
&lt;br&gt;&amp;gt;&amp;gt; and lock-out the queue if this number crosses the threshold (similarly to
&lt;br&gt;&amp;gt;&amp;gt; what
&lt;br&gt;&amp;gt;&amp;gt; we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that
&lt;br&gt;&amp;gt;&amp;gt; diagnostic
&lt;br&gt;&amp;gt;&amp;gt; output, to see how many unique hosts are in the current working set.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; -- Best regards,
&lt;br&gt;&amp;gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26534032.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26528468</id>
	<title>Encoding the content got from Fetcher</title>
	<published>2009-11-26T04:03:52Z</published>
	<updated>2009-11-26T04:03:52Z</updated>
	<author>
		<name>Santiago Pérez</name>
	</author>
	<content type="html">Hej,
&lt;br&gt;&lt;br&gt;I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation.
&lt;br&gt;&lt;br&gt;In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (&amp;quot;\n&amp;quot;) and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default encoding like: Ã?Â¡, Ã?Â³, Ã?Â­, Ã?Â³, Ã?Âº, Ã?Â±, Ã?Â¼.
&lt;br&gt;&lt;br&gt;I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess).
&lt;br&gt;&lt;br&gt;Thanks in advance ;)</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26528145</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-26T03:36:57Z</published>
	<updated>2009-11-26T03:36:57Z</updated>
	<author>
		<name>Otis Gospodnetic-2</name>
	</author>
	<content type="html">I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&amp;nbsp;(plus &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;) are solutions to this problem, in addition to what Andrzej described below.
&lt;br&gt;&lt;br&gt;Can you try &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-770&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-770&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-769&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-769&lt;/a&gt;&amp;nbsp;?
&lt;br&gt;&lt;br&gt;Otis
&lt;br&gt;--
&lt;br&gt;Sematext is hiring -- &lt;a href=&quot;http://sematext.com/about/jobs.html?mls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sematext.com/about/jobs.html?mls&lt;/a&gt;&lt;br&gt;Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;----- Original Message ----
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; From: Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26528145&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26528145&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Sent: Wed, November 25, 2009 6:13:07 PM
&lt;br&gt;&amp;gt; Subject: Re: 100 fetches per second?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt; I have to say that I'm still puzzled. Here is the latest. I just restarted a
&lt;br&gt;&amp;gt; &amp;gt; run and then guess what :
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
&lt;br&gt;&amp;gt; &amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt; &amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. But why
&lt;br&gt;&amp;gt; &amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; &amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; &amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; &amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;&amp;gt; &amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt; &amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the content, i.e. 
&lt;br&gt;&amp;gt; requests don't time out, crawl-delay is low, but the download speed is very very 
&lt;br&gt;&amp;gt; low due to the limits at their end (either physical or artificial).
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per FetchQueue, 
&lt;br&gt;&amp;gt; and lock-out the queue if this number crosses the threshold (similarly to what 
&lt;br&gt;&amp;gt; we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that diagnostic 
&lt;br&gt;&amp;gt; output, to see how many unique hosts are in the current working set.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; -- Best regards,
&lt;br&gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26528145.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26527378</id>
	<title>remove fields</title>
	<published>2009-11-26T02:31:05Z</published>
	<updated>2009-11-26T02:31:05Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">hi all,
&lt;br&gt;&lt;br&gt;there are 4 document fields in my index that i am not indexing anymore;
&lt;br&gt;&lt;br&gt;then i have 4 new fields i need to add to my index, so i created a new
&lt;br&gt;indexing filter.
&lt;br&gt;&lt;br&gt;how i can add these new fields while preserving the removed fields in
&lt;br&gt;the existing docs?
&lt;br&gt;&lt;br&gt;at the moment when i run bin/index all non-indexed fields get removed
&lt;br&gt;from the index;
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/remove-fields-tp26527378p26527378.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26525161</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T23:11:01Z</published>
	<updated>2009-11-25T23:11:01Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Dennis,
&lt;br&gt;&lt;br&gt;Interesting info, I don't use the standard OPIC scorer but a slightly
&lt;br&gt;modified version which boost pages with content that I'm looking for... so
&lt;br&gt;it could be that my pages are generally on slow servers.
&lt;br&gt;&lt;br&gt;Now heads-up, just started a new run with 450k URLs and it looks like I'm
&lt;br&gt;back to the previous behaviour :
&lt;br&gt;+ 4 Mb/s for a few minutes
&lt;br&gt;+ steady 1.9 Mb/s ... for ages probably since it really means around 10-15
&lt;br&gt;Fetch/s
&lt;br&gt;&lt;br&gt;Why did the previous run go so fast ???? I'm still wondering
&lt;br&gt;&lt;br&gt;2009/11/26 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26525161&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; One interesting thing we were seeing a while back on large crawls where we
&lt;br&gt;&amp;gt; were fetching the best scoring pages first, then next best, and so on, is
&lt;br&gt;&amp;gt; that lower scoring pages typically had worse response time rates and worse
&lt;br&gt;&amp;gt; timeout rates.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; So while the best scoring pages would respond very quickly and would have &amp;lt;
&lt;br&gt;&amp;gt; 1% timeout rate, the worst scoring pages would take x times as long (don't
&lt;br&gt;&amp;gt; remember the exact ratio but it was multiples) and could have as high as a
&lt;br&gt;&amp;gt; 50% timeout rate. &amp;nbsp;Just something to think about.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Andrzej Bialecki wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I have to say that I'm still puzzled. Here is the latest. I just
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; restarted a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; run and then guess what :
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; get
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. But
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; why
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the content,
&lt;br&gt;&amp;gt;&amp;gt; i.e. requests don't time out, crawl-delay is low, but the download speed is
&lt;br&gt;&amp;gt;&amp;gt; very very low due to the limits at their end (either physical or
&lt;br&gt;&amp;gt;&amp;gt; artificial).
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per
&lt;br&gt;&amp;gt;&amp;gt; FetchQueue, and lock-out the queue if this number crosses the threshold
&lt;br&gt;&amp;gt;&amp;gt; (similarly to what we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that
&lt;br&gt;&amp;gt;&amp;gt; diagnostic output, to see how many unique hosts are in the current working
&lt;br&gt;&amp;gt;&amp;gt; set.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26525161.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26525059</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T23:01:24Z</published>
	<updated>2009-11-25T23:01:24Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Did not think of that one... interesting
&lt;br&gt;&lt;br&gt;how &amp; where do you control the number of FetchQueues I only use the default
&lt;br&gt;so I assume there is only one.
&lt;br&gt;&lt;br&gt;How I should do if I want to analyze the content of a generated fetchlist ?
&lt;br&gt;&lt;br&gt;Is it possible to increase the number of fetcher on a single node
&lt;br&gt;configuration ? If not than I may turn to a configuration with two low specs
&lt;br&gt;servers vs one middle range... I will get more out my bucks ;-)
&lt;br&gt;&lt;br&gt;2009/11/26 Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26525059&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I have to say that I'm still puzzled. Here is the latest. I just restarted
&lt;br&gt;&amp;gt;&amp;gt; a
&lt;br&gt;&amp;gt;&amp;gt; run and then guess what :
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
&lt;br&gt;&amp;gt;&amp;gt; get
&lt;br&gt;&amp;gt;&amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt;&amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. But
&lt;br&gt;&amp;gt;&amp;gt; why
&lt;br&gt;&amp;gt;&amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;&amp;gt;&amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt;&amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the content,
&lt;br&gt;&amp;gt; i.e. requests don't time out, crawl-delay is low, but the download speed is
&lt;br&gt;&amp;gt; very very low due to the limits at their end (either physical or
&lt;br&gt;&amp;gt; artificial).
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per
&lt;br&gt;&amp;gt; FetchQueue, and lock-out the queue if this number crosses the threshold
&lt;br&gt;&amp;gt; (similarly to what we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that
&lt;br&gt;&amp;gt; diagnostic output, to see how many unique hosts are in the current working
&lt;br&gt;&amp;gt; set.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; Best regards,
&lt;br&gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; &amp;nbsp;___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26525059.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26524482</id>
	<title>Re: Exception while slicing and parsing old segments without fetching</title>
	<published>2009-11-25T21:17:29Z</published>
	<updated>2009-11-25T21:17:29Z</updated>
	<author>
		<name>srinivasarao v</name>
	</author>
	<content type="html">Hi Vishal,
&lt;br&gt;&lt;br&gt;I got the same prolem while runing updatedb and invertlinks.
&lt;br&gt;Have you got the solution to the problem?
&lt;br&gt;Please let me know if u get the solution.
&lt;br&gt;&lt;br&gt;Thank You,
&lt;br&gt;Srinivas
&lt;br&gt;&lt;br&gt;On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26524482&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;vishal.ce@...&lt;/a&gt;&amp;gt;wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi All,
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; I had a big segment(size= 25 GB). Using &amp;quot;mergesegs utility and
&lt;br&gt;&amp;gt; slice=20000&amp;quot; , I have divided the segment into around 400 small segments. I
&lt;br&gt;&amp;gt; re-paresed(using parse command) all the segments because we have made
&lt;br&gt;&amp;gt; changes into the parsing modules of Nutch. Parsing was completed
&lt;br&gt;&amp;gt; successfully for all segments. Linkdb is also generated successfully.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have following questions.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 1. Do I need to run &amp;quot;Updatedb&amp;quot; on the parsed segments again? When I run
&lt;br&gt;&amp;gt; Updatedb command on these segments, I am getting following exception.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; ----------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt; 2009-08-17 20:09:33,679 WARN &amp;nbsp;fs.FSInputChecker - Problem reading checksum
&lt;br&gt;&amp;gt; file: java.io.EOFException. Ignoring.
&lt;br&gt;&amp;gt; 2009-08-17 20:09:33,700 WARN &amp;nbsp;mapred.LocalJobRunner - job_fmwtmv
&lt;br&gt;&amp;gt; java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0,
&lt;br&gt;&amp;gt; summed=3584, read=4096, bytesPerSum=1, inSum=512
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.io.DataInputStream.readFully(DataInputStream.java:178)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
&lt;br&gt;&amp;gt; Caused by: java.lang.ArrayIndexOutOfBoundsException
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.util.zip.CRC32.update(CRC32.java:43)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;... 16 more
&lt;br&gt;&amp;gt; 2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update:
&lt;br&gt;&amp;gt; java.io.IOException: Job failed!
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -----------------------------------------------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2. When I run the &amp;quot;index&amp;quot; command on the segments,crawldb and linkdb, I am
&lt;br&gt;&amp;gt; getting &amp;quot;java heap space&amp;quot; error. While with single big segments and same
&lt;br&gt;&amp;gt; configuration of Java heap, we were able to index the segments. Are we
&lt;br&gt;&amp;gt; doing
&lt;br&gt;&amp;gt; something wrong? We will be thankful if somebody could give us some
&lt;br&gt;&amp;gt; pointers
&lt;br&gt;&amp;gt; in the problems.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp;java.lang.OutOfMemoryError: Java heap space
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.util.Arrays.copyOf(Arrays.java:2786)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at java.io.DataOutputStream.write(DataOutputStream.java:90)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.io.Text.writeString(Text.java:399)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.metadata.Metadata.write(Metadata.java:225)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.parse.ParseData.write(ParseData.java:165)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.indexer.Indexer.map(Indexer.java:362)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
&lt;br&gt;&amp;gt; 2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer:
&lt;br&gt;&amp;gt; java.io.IOException: Job failed!
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.indexer.Indexer.index(Indexer.java:329)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.indexer.Indexer.run(Indexer.java:351)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.indexer.Indexer.main(Indexer.java:334)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; ----------------------------------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; Thanks and Regards,
&lt;br&gt;&amp;gt; Vishal Vachhani
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;&lt;a href=&quot;http://cheyuta.wordpress.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://cheyuta.wordpress.com&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Exception-while-slicing-and-parsing-old-segments-without-fetching-tp25112345p26524482.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26522811</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T16:42:20Z</published>
	<updated>2009-11-25T16:42:20Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">One interesting thing we were seeing a while back on large crawls where 
&lt;br&gt;we were fetching the best scoring pages first, then next best, and so 
&lt;br&gt;on, is that lower scoring pages typically had worse response time rates 
&lt;br&gt;and worse timeout rates.
&lt;br&gt;&lt;br&gt;So while the best scoring pages would respond very quickly and would 
&lt;br&gt;have &amp;lt; 1% timeout rate, the worst scoring pages would take x times as 
&lt;br&gt;long (don't remember the exact ratio but it was multiples) and could 
&lt;br&gt;have as high as a 50% timeout rate. &amp;nbsp;Just something to think about.
&lt;br&gt;&lt;br&gt;Dennis Kubes
&lt;br&gt;&lt;br&gt;Andrzej Bialecki wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt; I have to say that I'm still puzzled. Here is the latest. I just 
&lt;br&gt;&amp;gt;&amp;gt; restarted a
&lt;br&gt;&amp;gt;&amp;gt; run and then guess what :
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could 
&lt;br&gt;&amp;gt;&amp;gt; only get
&lt;br&gt;&amp;gt;&amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt;&amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. 
&lt;br&gt;&amp;gt;&amp;gt; But why
&lt;br&gt;&amp;gt;&amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;&amp;gt;&amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt;&amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Would that be explained by a better URL mix ????
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I suspect that you have a bunch of hosts that slowly trickle the 
&lt;br&gt;&amp;gt; content, i.e. requests don't time out, crawl-delay is low, but the 
&lt;br&gt;&amp;gt; download speed is very very low due to the limits at their end (either 
&lt;br&gt;&amp;gt; physical or artificial).
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; The solution in that case would be to track a minimum avg. speed per 
&lt;br&gt;&amp;gt; FetchQueue, and lock-out the queue if this number crosses the threshold 
&lt;br&gt;&amp;gt; (similarly to what we do when we discover a crawl-delay that is too high).
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; In the meantime, you could add the number of FetchQueue-s to that 
&lt;br&gt;&amp;gt; diagnostic output, to see how many unique hosts are in the current 
&lt;br&gt;&amp;gt; working set.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26522811.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26521968</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T15:13:07Z</published>
	<updated>2009-11-25T15:13:07Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">MilleBii wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I have to say that I'm still puzzled. Here is the latest. I just restarted a
&lt;br&gt;&amp;gt; run and then guess what :
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
&lt;br&gt;&amp;gt; 3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;&amp;gt; A few samples show that I was running at 50 Fetches/sec ... not bad. But why
&lt;br&gt;&amp;gt; this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Than it drops and I get that kind of logs
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; 2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; 2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; 2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;&amp;gt; 2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;&amp;gt; spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;&amp;gt; mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;&amp;gt; percent complete for the 2 map it generated.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Would that be explained by a better URL mix ????
&lt;/div&gt;&lt;br&gt;I suspect that you have a bunch of hosts that slowly trickle the 
&lt;br&gt;content, i.e. requests don't time out, crawl-delay is low, but the 
&lt;br&gt;download speed is very very low due to the limits at their end (either 
&lt;br&gt;physical or artificial).
&lt;br&gt;&lt;br&gt;The solution in that case would be to track a minimum avg. speed per 
&lt;br&gt;FetchQueue, and lock-out the queue if this number crosses the threshold 
&lt;br&gt;(similarly to what we do when we discover a crawl-delay that is too high).
&lt;br&gt;&lt;br&gt;In the meantime, you could add the number of FetchQueue-s to that 
&lt;br&gt;diagnostic output, to see how many unique hosts are in the current 
&lt;br&gt;working set.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26521968.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26521599</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T14:45:35Z</published>
	<updated>2009-11-25T14:45:35Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">I have to say that I'm still puzzled. Here is the latest. I just restarted a
&lt;br&gt;run and then guess what :
&lt;br&gt;&lt;br&gt;got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
&lt;br&gt;3Mbit/s max before (nota bits and not bytes as I said before).
&lt;br&gt;A few samples show that I was running at 50 Fetches/sec ... not bad. But why
&lt;br&gt;this high-speed on this run I haven't got the faintest idea.
&lt;br&gt;&lt;br&gt;&lt;br&gt;Than it drops and I get that kind of logs
&lt;br&gt;&lt;br&gt;2009-11-25 23:28:28,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;2009-11-25 23:28:29,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;2009-11-25 23:28:29,584 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;2009-11-25 23:28:30,227 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;spinWaiting=100, fetchQueues.totalSize=120
&lt;br&gt;2009-11-25 23:28:30,585 INFO &amp;nbsp;fetcher.Fetcher - -activeThreads=100,
&lt;br&gt;spinWaiting=100, fetchQueues.totalSize=516
&lt;br&gt;&lt;br&gt;Don't fully understand why it is oscillating between two queue size never
&lt;br&gt;mind.... but it is likely the end of the run since hadoop shows 99.99%
&lt;br&gt;percent complete for the 2 map it generated.
&lt;br&gt;&lt;br&gt;Would that be explained by a better URL mix ????
&lt;br&gt;&lt;br&gt;2009/11/25 Mark Kerzner &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;markkerzner@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Judging by how this discussion goes, there may be a need for URL mix
&lt;br&gt;&amp;gt; optimizer and for a fast crawler based on that. Is this something worth
&lt;br&gt;&amp;gt; pursuing. MilleBii, q'en pensez vous?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Mark
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Wed, Nov 25, 2009 at 3:44 PM, MilleBii &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; The logs show that my fetch queue is full and my 100 threads are mostly
&lt;br&gt;&amp;gt; &amp;gt; spin
&lt;br&gt;&amp;gt; &amp;gt; waiting towards the end.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Now the very last run (150kURLs) I can clearly see 4 phases:
&lt;br&gt;&amp;gt; &amp;gt; + very high speed : 3MB/s &amp;nbsp;for a few minutes
&lt;br&gt;&amp;gt; &amp;gt; + sudden speed drop around 1MB/s and flat for several hours
&lt;br&gt;&amp;gt; &amp;gt; + another speed drop to around 400kB/s for several hours
&lt;br&gt;&amp;gt; &amp;gt; + another speed drop to around &amp;nbsp;200kB/s for a few hours two.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; So probably it is just a consequence of the url mix which isn't that good
&lt;br&gt;&amp;gt; &amp;gt; nota: I have limited to 1000 URLS per host, and there are about 20-30
&lt;br&gt;&amp;gt; hosts
&lt;br&gt;&amp;gt; &amp;gt; in the mix which get limited that way.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; May be there is better mix of URLs possible ?
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; 2009/11/25 Julien Nioche &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; or it is stuck on a couple of hosts which time out? The logs should
&lt;br&gt;&amp;gt; have
&lt;br&gt;&amp;gt; &amp;gt; a
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; trace with the number of active threads, which should give some
&lt;br&gt;&amp;gt; &amp;gt; indication
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; of what's happening.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; Julien
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; 2009/11/25 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; If it is waiting and the box is idle, my first though is not dns. &amp;nbsp;I
&lt;br&gt;&amp;gt; &amp;gt; just
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; put that up as one of the things people will run into. &amp;nbsp;Most likely
&lt;br&gt;&amp;gt; it
&lt;br&gt;&amp;gt; &amp;gt; is
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; uneven distribution of urls or something like that.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; Get your point... Although I thought high number of threads would do
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; exactly the same. Maybe I miss something.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; I/O is low, the CPU is low... So it must be waiting for something
&lt;br&gt;&amp;gt; but
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; what ?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; Could be the DNS cache wich is full and any new request gets
&lt;br&gt;&amp;gt; forwarded
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; to the master DNS of my ISP,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; Any idea how to check that ? I'm not familiar with Bind myself...
&lt;br&gt;&amp;gt; What
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; is the typical rate you can get how many dns request/s ?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; 2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=4&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can
&lt;br&gt;&amp;gt; act
&lt;br&gt;&amp;gt; &amp;gt; as
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; For example say I have a single centralized DNS server for my
&lt;br&gt;&amp;gt; &amp;gt; network.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; task.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; &amp;nbsp;That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility
&lt;br&gt;&amp;gt; of
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller
&lt;br&gt;&amp;gt; &amp;gt; networks
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and
&lt;br&gt;&amp;gt; &amp;gt; that
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The
&lt;br&gt;&amp;gt; &amp;gt; entire
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an
&lt;br&gt;&amp;gt; effect
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; because you are making one hop to get the name versus multiple hops
&lt;br&gt;&amp;gt; &amp;gt; to
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit
&lt;br&gt;&amp;gt; by
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is
&lt;br&gt;&amp;gt; &amp;gt; not/really
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and
&lt;br&gt;&amp;gt; I
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26521599&amp;i=5&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing,
&lt;br&gt;&amp;gt; &amp;gt; of
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; domains
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even
&lt;br&gt;&amp;gt; &amp;gt; if
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; it
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; -MilleBii-
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26521599.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520902</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T13:50:31Z</published>
	<updated>2009-11-25T13:50:31Z</updated>
	<author>
		<name>Mark Kerzner-2</name>
	</author>
	<content type="html">Judging by how this discussion goes, there may be a need for URL mix
&lt;br&gt;optimizer and for a fast crawler based on that. Is this something worth
&lt;br&gt;pursuing. MilleBii, q'en pensez vous?
&lt;br&gt;&lt;br&gt;Mark
&lt;br&gt;&lt;br&gt;On Wed, Nov 25, 2009 at 3:44 PM, MilleBii &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520902&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;millebii@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; The logs show that my fetch queue is full and my 100 threads are mostly
&lt;br&gt;&amp;gt; spin
&lt;br&gt;&amp;gt; waiting towards the end.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Now the very last run (150kURLs) I can clearly see 4 phases:
&lt;br&gt;&amp;gt; + very high speed : 3MB/s &amp;nbsp;for a few minutes
&lt;br&gt;&amp;gt; + sudden speed drop around 1MB/s and flat for several hours
&lt;br&gt;&amp;gt; + another speed drop to around 400kB/s for several hours
&lt;br&gt;&amp;gt; + another speed drop to around &amp;nbsp;200kB/s for a few hours two.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; So probably it is just a consequence of the url mix which isn't that good
&lt;br&gt;&amp;gt; nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
&lt;br&gt;&amp;gt; in the mix which get limited that way.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; May be there is better mix of URLs possible ?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/11/25 Julien Nioche &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520902&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; or it is stuck on a couple of hosts which time out? The logs should have
&lt;br&gt;&amp;gt; a
&lt;br&gt;&amp;gt; &amp;gt; trace with the number of active threads, which should give some
&lt;br&gt;&amp;gt; indication
&lt;br&gt;&amp;gt; &amp;gt; of what's happening.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Julien
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; 2009/11/25 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520902&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; If it is waiting and the box is idle, my first though is not dns. &amp;nbsp;I
&lt;br&gt;&amp;gt; just
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; put that up as one of the things people will run into. &amp;nbsp;Most likely it
&lt;br&gt;&amp;gt; is
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; uneven distribution of urls or something like that.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; Get your point... Although I thought high number of threads would do
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; exactly the same. Maybe I miss something.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; I/O is low, the CPU is low... So it must be waiting for something but
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; what ?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; Could be the DNS cache wich is full and any new request gets forwarded
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; to the master DNS of my ISP,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; Any idea how to check that ? I'm not familiar with Bind myself... What
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; is the typical rate you can get how many dns request/s ?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt; 2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520902&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act
&lt;br&gt;&amp;gt; as
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; For example say I have a single centralized DNS server for my
&lt;br&gt;&amp;gt; network.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per
&lt;br&gt;&amp;gt; &amp;gt; task.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; &amp;nbsp;That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller
&lt;br&gt;&amp;gt; networks
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and
&lt;br&gt;&amp;gt; that
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The
&lt;br&gt;&amp;gt; entire
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an effect
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; because you are making one hop to get the name versus multiple hops
&lt;br&gt;&amp;gt; to
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is
&lt;br&gt;&amp;gt; not/really
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520902&amp;i=4&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing,
&lt;br&gt;&amp;gt; of
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; domains
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even
&lt;br&gt;&amp;gt; if
&lt;br&gt;&amp;gt; &amp;gt; it
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt; &amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; -MilleBii-
&lt;br&gt;&amp;gt;
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26520902.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26520818</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T13:44:19Z</published>
	<updated>2009-11-25T13:44:19Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">The logs show that my fetch queue is full and my 100 threads are mostly spin
&lt;br&gt;waiting towards the end.
&lt;br&gt;&lt;br&gt;Now the very last run (150kURLs) I can clearly see 4 phases:
&lt;br&gt;+ very high speed : 3MB/s &amp;nbsp;for a few minutes
&lt;br&gt;+ sudden speed drop around 1MB/s and flat for several hours
&lt;br&gt;+ another speed drop to around 400kB/s for several hours
&lt;br&gt;+ another speed drop to around &amp;nbsp;200kB/s for a few hours two.
&lt;br&gt;&lt;br&gt;So probably it is just a consequence of the url mix which isn't that good
&lt;br&gt;nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
&lt;br&gt;in the mix which get limited that way.
&lt;br&gt;&lt;br&gt;May be there is better mix of URLs possible ?
&lt;br&gt;&lt;br&gt;2009/11/25 Julien Nioche &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520818&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; or it is stuck on a couple of hosts which time out? The logs should have a
&lt;br&gt;&amp;gt; trace with the number of active threads, which should give some indication
&lt;br&gt;&amp;gt; of what's happening.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Julien
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/11/25 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520818&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; If it is waiting and the box is idle, my first though is not dns. &amp;nbsp;I just
&lt;br&gt;&amp;gt; &amp;gt; put that up as one of the things people will run into. &amp;nbsp;Most likely it is
&lt;br&gt;&amp;gt; &amp;gt; uneven distribution of urls or something like that.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Get your point... Although I thought high number of threads would do
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; exactly the same. Maybe I miss something.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; I/O is low, the CPU is low... So it must be waiting for something but
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; what ?
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Could be the DNS cache wich is full and any new request gets forwarded
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; to the master DNS of my ISP,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; Any idea how to check that ? I'm not familiar with Bind myself... What
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; is the typical rate you can get how many dns request/s ?
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; 2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520818&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act as
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; For example say I have a single centralized DNS server for my network.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per
&lt;br&gt;&amp;gt; task.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; &amp;nbsp;That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller networks
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and that
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The entire
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an effect
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; because you are making one hop to get the name versus multiple hops to
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is not/really
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26520818&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing, of
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; domains
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even if
&lt;br&gt;&amp;gt; it
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26520818.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518193</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T10:46:09Z</published>
	<updated>2009-11-25T10:46:09Z</updated>
	<author>
		<name>Julien Nioche-4</name>
	</author>
	<content type="html">or it is stuck on a couple of hosts which time out? The logs should have a
&lt;br&gt;trace with the number of active threads, which should give some indication
&lt;br&gt;of what's happening.
&lt;br&gt;&lt;br&gt;Julien
&lt;br&gt;&lt;br&gt;&lt;br&gt;2009/11/25 Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26518193&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; If it is waiting and the box is idle, my first though is not dns. &amp;nbsp;I just
&lt;br&gt;&amp;gt; put that up as one of the things people will run into. &amp;nbsp;Most likely it is
&lt;br&gt;&amp;gt; uneven distribution of urls or something like that.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Get your point... Although I thought high number of threads would do
&lt;br&gt;&amp;gt;&amp;gt; exactly the same. Maybe I miss something.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;&amp;gt;&amp;gt; I/O is low, the CPU is low... So it must be waiting for something but
&lt;br&gt;&amp;gt;&amp;gt; what ?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Could be the DNS cache wich is full and any new request gets forwarded
&lt;br&gt;&amp;gt;&amp;gt; to the master DNS of my ISP,
&lt;br&gt;&amp;gt;&amp;gt; Any idea how to check that ? I'm not familiar with Bind myself... What
&lt;br&gt;&amp;gt;&amp;gt; is the typical rate you can get how many dns request/s ?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26518193&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act as
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; For example say I have a single centralized DNS server for my network.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per task.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp;That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller networks
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and that
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The entire
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an effect
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; because you are making one hop to get the name versus multiple hops to
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is not/really
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26518193&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing, of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; domains
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even if it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;DigitalPebble Ltd
&lt;br&gt;&lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26518193.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518015</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T10:34:10Z</published>
	<updated>2009-11-25T10:34:10Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">If it is waiting and the box is idle, my first though is not dns. &amp;nbsp;I 
&lt;br&gt;just put that up as one of the things people will run into. &amp;nbsp;Most likely 
&lt;br&gt;it is uneven distribution of urls or something like that.
&lt;br&gt;&lt;br&gt;Dennis
&lt;br&gt;&lt;br&gt;MilleBii wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Get your point... Although I thought high number of threads would do
&lt;br&gt;&amp;gt; exactly the same. Maybe I miss something.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;&amp;gt; I/O is low, the CPU is low... So it must be waiting for something but
&lt;br&gt;&amp;gt; what ?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Could be the DNS cache wich is full and any new request gets forwarded
&lt;br&gt;&amp;gt; to the master DNS of my ISP,
&lt;br&gt;&amp;gt; Any idea how to check that ? I'm not familiar with Bind myself... What
&lt;br&gt;&amp;gt; is the typical rate you can get how many dns request/s ?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26518015&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act as
&lt;br&gt;&amp;gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; For example say I have a single centralized DNS server for my network.
&lt;br&gt;&amp;gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per task.
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of
&lt;br&gt;&amp;gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller networks
&lt;br&gt;&amp;gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and that
&lt;br&gt;&amp;gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The entire
&lt;br&gt;&amp;gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an effect
&lt;br&gt;&amp;gt;&amp;gt; because you are making one hop to get the name versus multiple hops to
&lt;br&gt;&amp;gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is not/really
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26518015&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing, of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100 domains
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even if it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26518015.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517467</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T09:56:46Z</published>
	<updated>2009-11-25T09:56:46Z</updated>
	<author>
		<name>MilleBii</name>
	</author>
	<content type="html">Get your point... Although I thought high number of threads would do
&lt;br&gt;exactly the same. Maybe I miss something.
&lt;br&gt;&lt;br&gt;During my fetcher runs used bandwidth gets low pretty quickly, disk
&lt;br&gt;I/O is low, the CPU is low... So it must be waiting for something but
&lt;br&gt;what ?
&lt;br&gt;&lt;br&gt;Could be the DNS cache wich is full and any new request gets forwarded
&lt;br&gt;to the master DNS of my ISP,
&lt;br&gt;Any idea how to check that ? I'm not familiar with Bind myself... What
&lt;br&gt;is the typical rate you can get how many dns request/s ?
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;2009/11/25, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26517467&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; It is not about the local DNS caching as much as having local DNS
&lt;br&gt;&amp;gt; servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act as
&lt;br&gt;&amp;gt; a DOS attack and slow down the entire fetching system.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; For example say I have a single centralized DNS server for my network.
&lt;br&gt;&amp;gt; And say I have 2 map task per machine, 50 machines, 20 threads per task.
&lt;br&gt;&amp;gt; &amp;nbsp; That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of
&lt;br&gt;&amp;gt; 2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller networks
&lt;br&gt;&amp;gt; can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and that
&lt;br&gt;&amp;gt; DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The entire
&lt;br&gt;&amp;gt; fetching system stalls.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Hitting a secondary larger cache, such as OpenDNS, can have an effect
&lt;br&gt;&amp;gt; because you are making one hop to get the name versus multiple hops to
&lt;br&gt;&amp;gt; root servers then domain servers.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Working off of a single server these issues don't show up as much
&lt;br&gt;&amp;gt; because there aren't enough fetchers.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Dennis Kubes
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; MilleBii wrote:
&lt;br&gt;&amp;gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; if you have segments with only/mainly different sites it is not/really
&lt;br&gt;&amp;gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26517467&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing, of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100 domains
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even if it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; discussed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;Envoyé avec mon mobile
&lt;br&gt;&lt;br&gt;-MilleBii-
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26517467.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26514994</id>
	<title>recrawl.sh stopped at depth 7/10 without error</title>
	<published>2009-11-25T07:43:33Z</published>
	<updated>2009-11-25T07:43:33Z</updated>
	<author>
		<name>miagomiago</name>
	</author>
	<content type="html">&lt;br&gt;&lt;br&gt;hi,
&lt;br&gt;&lt;br&gt;i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50
&lt;br&gt;&lt;br&gt;i checked the hadoop.log, and dont find any error there...i just find the last url it was parsing
&lt;br&gt;do fetching or crawling has a timeout ?
&lt;br&gt;my recrawl takes 2 hours before it stops. i set the time fetch interval 24 hours and i'm running the generate with adddays = 1
&lt;br&gt;&lt;br&gt;best regards
&lt;br&gt;&amp;nbsp;		 	 &amp;nbsp; 		 &amp;nbsp;
&lt;br&gt;_________________________________________________________________
&lt;br&gt;Eligible CDN College &amp; University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
&lt;br&gt;&lt;a href=&quot;http://go.microsoft.com/?linkid=9691819&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://go.microsoft.com/?linkid=9691819&lt;/a&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/recrawl.sh-stopped-at-depth-7-10-without-error-tp26514994p26514994.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26514886</id>
	<title>Re: dedup dont delete duplicates !</title>
	<published>2009-11-25T07:37:24Z</published>
	<updated>2009-11-25T07:37:24Z</updated>
	<author>
		<name>Mischa@Garlik</name>
	</author>
	<content type="html">Ok, &amp;nbsp;my bad.
&lt;br&gt;&lt;br&gt;M
&lt;br&gt;On 25 Nov 2009, at 15:35, BELLINI ADAM wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; plz mischa, if your problem is not about delete duplicate just open another thread ! thx
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages.
&lt;br&gt;&amp;gt; i will give you news when done.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; From: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514886&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;&amp;gt;&amp;gt; Subject: Re: dedup dont delete duplicates !
&lt;br&gt;&amp;gt;&amp;gt; Date: Wed, 25 Nov 2009 11:45:21 +0000
&lt;br&gt;&amp;gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514886&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Hello All, 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :(
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; &amp;lt;!--
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 11:42:49,299 INFO &amp;nbsp;crawl.Injector - Injector: done
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - LeaseChecker@DFSClient[clientName=DFSClient_-822770266, ugi=nutch,nutch]: java.lang.Throwable: for testing
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992)
&lt;br&gt;&amp;gt;&amp;gt; 	at java.lang.String.valueOf(String.java:2827)
&lt;br&gt;&amp;gt;&amp;gt; 	at java.lang.StringBuilder.append(StringBuilder.java:115)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981)
&lt;br&gt;&amp;gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt;&amp;gt; is interrupted.
&lt;br&gt;&amp;gt;&amp;gt; java.lang.InterruptedException: sleep interrupted
&lt;br&gt;&amp;gt;&amp;gt; 	at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978)
&lt;br&gt;&amp;gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; --&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Does anyone know what problem I am having ?
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Cheers, 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Mischa 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; BELLINI ADAM wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; hi,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; my two urls points to the same page !
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Please, no need to shout ...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; If the MD5 signatures are different, then the binary content of these pages is different, period.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; can you tell m eplz more about TextProfileSignature ? how should i
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt; use it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; -- 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Best regards,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; ___________________________________
&lt;br&gt;&amp;gt;&amp;gt; Mischa Tuffield
&lt;br&gt;&amp;gt;&amp;gt; Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514886&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;&amp;gt;&amp;gt; Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt; Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;&amp;gt;&amp;gt; +44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt; Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;&amp;gt;&amp;gt; Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt; 		 	 &amp;nbsp; 		 &amp;nbsp;
&lt;br&gt;&amp;gt; _________________________________________________________________
&lt;br&gt;&amp;gt; Eligible CDN College &amp; University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://go.microsoft.com/?linkid=9691819&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://go.microsoft.com/?linkid=9691819&lt;/a&gt;&lt;/div&gt;&lt;br&gt;___________________________________
&lt;br&gt;Mischa Tuffield
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514886&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;+44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26514886.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26514851</id>
	<title>RE: dedup dont delete duplicates !</title>
	<published>2009-11-25T07:35:26Z</published>
	<updated>2009-11-25T07:35:26Z</updated>
	<author>
		<name>miagomiago</name>
	</author>
	<content type="html">&lt;br&gt;plz mischa, if your problem is not about delete duplicate just open another thread ! thx
&lt;br&gt;&lt;br&gt;&lt;br&gt;Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages.
&lt;br&gt;i will give you news when done.
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; From: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514851&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Subject: Re: dedup dont delete duplicates !
&lt;br&gt;&amp;gt; Date: Wed, 25 Nov 2009 11:45:21 +0000
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514851&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Hello All, 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :(
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;!--
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009-11-25 11:42:49,299 INFO &amp;nbsp;crawl.Injector - Injector: done
&lt;br&gt;&amp;gt; 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - LeaseChecker@DFSClient[clientName=DFSClient_-822770266, ugi=nutch,nutch]: java.lang.Throwable: for testing
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992)
&lt;br&gt;&amp;gt; 	at java.lang.String.valueOf(String.java:2827)
&lt;br&gt;&amp;gt; 	at java.lang.StringBuilder.append(StringBuilder.java:115)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981)
&lt;br&gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt; &amp;nbsp;is interrupted.
&lt;br&gt;&amp;gt; java.lang.InterruptedException: sleep interrupted
&lt;br&gt;&amp;gt; 	at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978)
&lt;br&gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; --&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Does anyone know what problem I am having ?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Cheers, 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Mischa 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; BELLINI ADAM wrote:
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; hi,
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; my two urls points to the same page !
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Please, no need to shout ...
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; If the MD5 signatures are different, then the binary content of these pages is different, period.
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility.
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; can you tell m eplz more about TextProfileSignature ? how should i
&lt;br&gt;&amp;gt; &amp;gt;&amp;gt; use it
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures.
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; -- 
&lt;br&gt;&amp;gt; &amp;gt; Best regards,
&lt;br&gt;&amp;gt; &amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; &amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; &amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; &amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; ___________________________________
&lt;br&gt;&amp;gt; Mischa Tuffield
&lt;br&gt;&amp;gt; Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26514851&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;&amp;gt; Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;&amp;gt; +44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;&amp;gt; Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;&amp;gt; Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&amp;nbsp;		 	 &amp;nbsp; 		 &amp;nbsp;
&lt;br&gt;_________________________________________________________________
&lt;br&gt;Eligible CDN College &amp; University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
&lt;br&gt;&lt;a href=&quot;http://go.microsoft.com/?linkid=9691819&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://go.microsoft.com/?linkid=9691819&lt;/a&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26514851.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26512445</id>
	<title>Re: Nutch config IOException</title>
	<published>2009-11-25T05:15:03Z</published>
	<updated>2009-11-25T05:15:03Z</updated>
	<author>
		<name>Mischa@Garlik</name>
	</author>
	<content type="html">Hi Andrzej, 
&lt;br&gt;&lt;br&gt;Yeah, I just noticed that this stack trace is for DEBUG purposes only I found it in the hadoop src, thanks for the info. 
&lt;br&gt;&lt;br&gt;Regards, 
&lt;br&gt;&lt;br&gt;Mischa
&lt;br&gt;On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Mischa Tuffield wrote:
&lt;br&gt;&amp;gt;&amp;gt; Hello Again, Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. &amp;lt;!--
&lt;br&gt;&amp;gt;&amp;gt; 2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:176)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:164)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.protocol.FSConstants.&amp;lt;clinit&amp;gt;(FSConstants.java:51)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
&lt;br&gt;&amp;gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
&lt;br&gt;&amp;gt;&amp;gt; --&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Any pointers would be great, I wonder is there a way for me to validate my conf options before I deploy nutch?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; This exception is innocuous - it helps to debug at which points in the code the Configuration instances are being created. And you wouldn't have seen this if you didn't turn on the DEBUG logging. ;)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; -- 
&lt;br&gt;&amp;gt; Best regards,
&lt;br&gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;___________________________________
&lt;br&gt;Mischa Tuffield
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26512445&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;+44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26512445.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26512399</id>
	<title>Re: Nutch config IOException</title>
	<published>2009-11-25T05:11:37Z</published>
	<updated>2009-11-25T05:11:37Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Mischa Tuffield wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hello Again, 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;!--
&lt;br&gt;&amp;gt; 2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:176)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:164)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.protocol.FSConstants.&amp;lt;clinit&amp;gt;(FSConstants.java:51)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; --&amp;gt;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Any pointers would be great, I wonder is there a way for me to validate my conf options before I deploy nutch?
&lt;/div&gt;&lt;br&gt;This exception is innocuous - it helps to debug at which points in the 
&lt;br&gt;code the Configuration instances are being created. And you wouldn't 
&lt;br&gt;have seen this if you didn't turn on the DEBUG logging. ;)
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26512399.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26512184</id>
	<title>Re: 100 fetches per second?</title>
	<published>2009-11-25T04:57:05Z</published>
	<updated>2009-11-25T04:57:05Z</updated>
	<author>
		<name>Dennis Kubes-2</name>
	</author>
	<content type="html">It is not about the local DNS caching as much as having local DNS 
&lt;br&gt;servers. &amp;nbsp;Too many fetchers hitting a centralized DNS server can act as 
&lt;br&gt;a DOS attack and slow down the entire fetching system.
&lt;br&gt;&lt;br&gt;For example say I have a single centralized DNS server for my network. 
&lt;br&gt;And say I have 2 map task per machine, 50 machines, 20 threads per task. 
&lt;br&gt;&amp;nbsp; That would be 50 * 2 * 20 = 2000 fetchers. &amp;nbsp;Meaning a possibility of 
&lt;br&gt;2000 &amp;nbsp;DNS requests / sec. &amp;nbsp;Most local DNS servers for smaller networks 
&lt;br&gt;can't handle that. &amp;nbsp;If everything is hitting a centralized DNS and that 
&lt;br&gt;DNS takes 1-3 sec per request because of too many requests. &amp;nbsp;The entire 
&lt;br&gt;fetching system stalls.
&lt;br&gt;&lt;br&gt;Hitting a secondary larger cache, such as OpenDNS, can have an effect 
&lt;br&gt;because you are making one hop to get the name versus multiple hops to 
&lt;br&gt;root servers then domain servers.
&lt;br&gt;&lt;br&gt;Working off of a single server these issues don't show up as much 
&lt;br&gt;because there aren't enough fetchers.
&lt;br&gt;&lt;br&gt;Dennis Kubes
&lt;br&gt;&lt;br&gt;MilleBii wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Why would DNS local caching work... It only is working if you are
&lt;br&gt;&amp;gt; going to crawl often the same site ... In which case you are hit by
&lt;br&gt;&amp;gt; the politeness.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; if you have segments with only/mainly different sites it is not/really
&lt;br&gt;&amp;gt; going to help.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; So far I have not seen my quad core + 100mb/s + pseudo distributed
&lt;br&gt;&amp;gt; hadoop &amp;nbsp;going faster than 10 fetch / s... Let me check the DNS and I
&lt;br&gt;&amp;gt; will tell you.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I vote for 100 Fetch/s not sure how to get it though
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009/11/24, Dennis Kubes &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26512184&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kubes@...&lt;/a&gt;&amp;gt;:
&lt;br&gt;&amp;gt;&amp;gt; Hi Mark,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I just put this up on the wiki. &amp;nbsp;Hope it helps:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/OptimizingCrawls&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/OptimizingCrawls&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Dennis
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Mark Kerzner wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Hi, guys,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; my goal is to do by crawls at 100 fetches per second, observing, of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; course,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; polite crawling. But, when URLs are all different domains, what
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; theoretically would stop some software from downloading from 100 domains
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; once, achieving the desired speed?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; But, whatever I do, I can't make Nutch crawl at that speed. Even if it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; starts at a few dozen URLs/second, it slows down at the end (as discussed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; by
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; many and by Krugler).
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Should I write something of my own, or are their fast crawlers?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Thanks!
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Mark
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/100-fetches-per-second--tp26490858p26512184.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26511735</id>
	<title>Nutch config IOException</title>
	<published>2009-11-25T04:21:25Z</published>
	<updated>2009-11-25T04:21:25Z</updated>
	<author>
		<name>Mischa@Garlik</name>
	</author>
	<content type="html">Hello Again, 
&lt;br&gt;&lt;br&gt;Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. 
&lt;br&gt;&lt;br&gt;&amp;lt;!--
&lt;br&gt;2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:176)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:164)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.protocol.FSConstants.&amp;lt;clinit&amp;gt;(FSConstants.java:51)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
&lt;br&gt;&lt;br&gt;--&amp;gt;
&lt;br&gt;&lt;br&gt;Any pointers would be great, I wonder is there a way for me to validate my conf options before I deploy nutch?
&lt;br&gt;&lt;br&gt;Regards, 
&lt;br&gt;&lt;br&gt;Mischa
&lt;br&gt;On 25 Nov 2009, at 11:45, Mischa Tuffield wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hello All, 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :(
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &amp;lt;!--
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 2009-11-25 11:42:49,299 INFO &amp;nbsp;crawl.Injector - Injector: done
&lt;br&gt;&amp;gt; 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - LeaseChecker@DFSClient[clientName=DFSClient_-822770266, ugi=nutch,nutch]: java.lang.Throwable: for testing
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992)
&lt;br&gt;&amp;gt; 	at java.lang.String.valueOf(String.java:2827)
&lt;br&gt;&amp;gt; 	at java.lang.StringBuilder.append(StringBuilder.java:115)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981)
&lt;br&gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt; is interrupted.
&lt;br&gt;&amp;gt; java.lang.InterruptedException: sleep interrupted
&lt;br&gt;&amp;gt; 	at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;gt; 	at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978)
&lt;br&gt;&amp;gt; 	at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; --&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Does anyone know what problem I am having ?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Cheers, 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Mischa 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; BELLINI ADAM wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; hi,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; my two urls points to the same page !
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Please, no need to shout ...
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; If the MD5 signatures are different, then the binary content of these pages is different, period.
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility.
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; can you tell m eplz more about TextProfileSignature ? how should i
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; use it
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures.
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; -- 
&lt;br&gt;&amp;gt;&amp;gt; Best regards,
&lt;br&gt;&amp;gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; ___________________________________
&lt;br&gt;&amp;gt; Mischa Tuffield
&lt;br&gt;&amp;gt; Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26511735&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;&amp;gt; Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;&amp;gt; +44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;&amp;gt; Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;&amp;gt; Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;___________________________________
&lt;br&gt;Mischa Tuffield
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26511735&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;+44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26511735.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26511343</id>
	<title>Re: dedup dont delete duplicates !</title>
	<published>2009-11-25T03:45:21Z</published>
	<updated>2009-11-25T03:45:21Z</updated>
	<author>
		<name>Mischa@Garlik</name>
	</author>
	<content type="html">Hello All, 
&lt;br&gt;&lt;br&gt;I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :(
&lt;br&gt;&lt;br&gt;&amp;lt;!--
&lt;br&gt;&lt;br&gt;2009-11-25 11:42:49,299 INFO &amp;nbsp;crawl.Injector - Injector: done
&lt;br&gt;2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - LeaseChecker@DFSClient[clientName=DFSClient_-822770266, ugi=nutch,nutch]: java.lang.Throwable: for testing
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.String.valueOf(String.java:2827)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.StringBuilder.append(StringBuilder.java:115)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;nbsp;is interrupted.
&lt;br&gt;java.lang.InterruptedException: sleep interrupted
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&lt;br&gt;--&amp;gt; 
&lt;br&gt;&lt;br&gt;Does anyone know what problem I am having ?
&lt;br&gt;&lt;br&gt;Cheers, 
&lt;br&gt;&lt;br&gt;Mischa 
&lt;br&gt;&lt;br&gt;On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; BELLINI ADAM wrote:
&lt;br&gt;&amp;gt;&amp;gt; hi,
&lt;br&gt;&amp;gt;&amp;gt; my two urls points to the same page !
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Please, no need to shout ...
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; If the MD5 signatures are different, then the binary content of these pages is different, period.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt;&amp;gt; can you tell m eplz more about TextProfileSignature ? how should i
&lt;br&gt;&amp;gt;&amp;gt; use it
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; -- 
&lt;br&gt;&amp;gt; Best regards,
&lt;br&gt;&amp;gt; Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;&amp;gt; [__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt; ___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;___________________________________
&lt;br&gt;Mischa Tuffield
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26511343&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mischa.tuffield@...&lt;/a&gt;
&lt;br&gt;Homepage - &lt;a href=&quot;http://mmt.me.uk/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mmt.me.uk/&lt;/a&gt;&lt;br&gt;Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
&lt;br&gt;+44(0)20 8973 2465 &amp;nbsp;&lt;a href=&quot;http://www.garlik.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.garlik.com/&lt;/a&gt;&lt;br&gt;Registered in England and Wales 535 7233 VAT # 849 0517 11
&lt;br&gt;Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26511343.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26510746</id>
	<title>Re: dedup dont delete duplicates !</title>
	<published>2009-11-25T02:52:18Z</published>
	<updated>2009-11-25T02:52:18Z</updated>
	<author>
		<name>reinhard schwab</name>
	</author>
	<content type="html">Andrzej Bialecki schrieb:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; BELLINI ADAM wrote:
&lt;br&gt;&amp;gt;&amp;gt; hi,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; my two urls points to the same page !
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Please, no need to shout ...
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; If the MD5 signatures are different, then the binary content of these
&lt;br&gt;&amp;gt; pages is different, period.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Use readseg -dump utility to retrieve the page content from the
&lt;br&gt;&amp;gt; segment, extract just the two pages from the dump, and run a unix diff
&lt;br&gt;&amp;gt; utility.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; can you tell m eplz more about TextProfileSignature ? how should i
&lt;br&gt;&amp;gt;&amp;gt; use it
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Configure this type of signature in your nutch-site.xml - please see
&lt;br&gt;&amp;gt; the nutch-default.xml for instructions. Please note that you will have
&lt;br&gt;&amp;gt; to re-parse segments and update the db in order to update the signatures.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;a href=&quot;http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/TextProfileSignature.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/TextProfileSignature.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;the class documentation describes how the algorithm works.
&lt;br&gt;it is based on the frequencies of tokens.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dedup-dont-delete-duplicates-%21-tp26503122p26510746.html" />
</entry>

</feed>
