<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-375</id>
	<title>Nabble - Nutch - User</title>
	<updated>2009-11-10T20:00:52Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Nutch---User-f375.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch---User-f375.html" />
	<subtitle type="html"></subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26295712</id>
	<title>nutch search yields 0 results</title>
	<published>2009-11-10T20:00:52Z</published>
	<updated>2009-11-10T20:00:52Z</updated>
	<author>
		<name>kvorion</name>
	</author>
	<content type="html">Hi all...
&lt;br&gt;&lt;br&gt;I was finally able to set up a multinode nutch cluster that seemed to work fine. When I set it up to do the example crawl of &lt;a href=&quot;http://lucene.apache.org&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org&lt;/a&gt;&amp;nbsp;then the crawl seemed to finish successfully as indicated by the output on the console. When I copied the index files on to the local filesystem from the dfs and viewed the web application of nutch.war using tomcat on localhost:8080, all searches yield 0 results. The log files do not indicate any exceptions. I don't understand what is wrong. Please help. Thanks in advance.
&lt;br&gt;&lt;br&gt;nutch@athena:/nutch/search$ bin/nutch crawl urls -dir crawled -depth 2
&lt;br&gt;crawl started in: crawled
&lt;br&gt;rootUrlDir = urls
&lt;br&gt;threads = 10
&lt;br&gt;depth = 2
&lt;br&gt;Injector: starting
&lt;br&gt;Injector: crawlDb: crawled/crawldb
&lt;br&gt;Injector: urlDir: urls
&lt;br&gt;Injector: Converting injected urls to crawl db entries.
&lt;br&gt;Injector: Merging injected urls into crawl db.
&lt;br&gt;Injector: done
&lt;br&gt;Generator: Selecting best-scoring urls due for fetch.
&lt;br&gt;Generator: starting
&lt;br&gt;Generator: segment: crawled/segments/20091110223659
&lt;br&gt;Generator: filtering: true
&lt;br&gt;Generator: Partitioning selected urls by host, for politeness.
&lt;br&gt;Generator: done.
&lt;br&gt;Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
&lt;br&gt;Fetcher: starting
&lt;br&gt;Fetcher: segment: crawled/segments/20091110223659
&lt;br&gt;Fetcher: done
&lt;br&gt;CrawlDb update: starting
&lt;br&gt;CrawlDb update: db: crawled/crawldb
&lt;br&gt;CrawlDb update: segments: [crawled/segments/20091110223659]
&lt;br&gt;CrawlDb update: additions allowed: true
&lt;br&gt;CrawlDb update: URL normalizing: true
&lt;br&gt;CrawlDb update: URL filtering: true
&lt;br&gt;CrawlDb update: Merging segment data into db.
&lt;br&gt;CrawlDb update: done
&lt;br&gt;Generator: Selecting best-scoring urls due for fetch.
&lt;br&gt;Generator: starting
&lt;br&gt;Generator: segment: crawled/segments/20091110223831
&lt;br&gt;Generator: filtering: true
&lt;br&gt;Generator: Partitioning selected urls by host, for politeness.
&lt;br&gt;Generator: done.
&lt;br&gt;Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
&lt;br&gt;Fetcher: starting
&lt;br&gt;Fetcher: segment: crawled/segments/20091110223831
&lt;br&gt;Fetcher: done
&lt;br&gt;CrawlDb update: starting
&lt;br&gt;CrawlDb update: db: crawled/crawldb
&lt;br&gt;CrawlDb update: segments: [crawled/segments/20091110223831]
&lt;br&gt;CrawlDb update: additions allowed: true
&lt;br&gt;CrawlDb update: URL normalizing: true
&lt;br&gt;CrawlDb update: URL filtering: true
&lt;br&gt;CrawlDb update: Merging segment data into db.
&lt;br&gt;CrawlDb update: done
&lt;br&gt;LinkDb: starting
&lt;br&gt;LinkDb: linkdb: crawled/linkdb
&lt;br&gt;LinkDb: URL normalize: true
&lt;br&gt;LinkDb: URL filter: true
&lt;br&gt;LinkDb: adding segment: hdfs://athena:9000/user/nutch/crawled/segments/20091110223659
&lt;br&gt;LinkDb: adding segment: hdfs://athena:9000/user/nutch/crawled/segments/20091110223831
&lt;br&gt;LinkDb: done
&lt;br&gt;Indexer: starting
&lt;br&gt;Indexer: done
&lt;br&gt;Dedup: starting
&lt;br&gt;Dedup: adding indexes in: crawled/indexes
&lt;br&gt;Dedup: done
&lt;br&gt;merging indexes to: crawled/index
&lt;br&gt;Adding hdfs://athena:9000/user/nutch/crawled/indexes/part-00000
&lt;br&gt;done merging
&lt;br&gt;crawl finished: crawled
&lt;br&gt;&lt;br&gt;In addition, my crawl-urlfilter.txt has the following:
&lt;br&gt;&lt;br&gt;# accept hosts in MY.DOMAIN.NAME
&lt;br&gt;+^&lt;a href=&quot;http://(&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://(&lt;/a&gt;[a-z0-9]*\.)*apache.org/
&lt;br&gt;&lt;br&gt;Also, the urllist.txt file has 
&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org&lt;/a&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/nutch-search-yields-0-results-tp26295712p26295712.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26295630</id>
	<title>Nutch 0.20</title>
	<published>2009-11-10T19:47:36Z</published>
	<updated>2009-11-10T19:47:36Z</updated>
	<author>
		<name>John Martyniak-4</name>
	</author>
	<content type="html">Hi everyone,
&lt;br&gt;&lt;br&gt;Does anybody know of a good way or is it possible to run nutch on &amp;nbsp;
&lt;br&gt;Hadoop 0.20.x?
&lt;br&gt;&lt;br&gt;thank you,
&lt;br&gt;&lt;br&gt;-John
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-0.20-tp26295630p26295630.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26294598</id>
	<title>dear</title>
	<published>2009-11-10T17:32:58Z</published>
	<updated>2009-11-10T17:32:58Z</updated>
	<author>
		<name>Girish Redekar</name>
	</author>
	<content type="html">Dear friend,&amp;lt;BR&amp;gt;How are you recently?&amp;lt;BR&amp;gt;I bought a laptop from a
&lt;br&gt;website: &amp;lt;A href=&amp;quot;&lt;a href=&quot;http://www.yeurl.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.yeurl.com&lt;/a&gt;&amp;quot;&amp;gt;www.yeurl.com&amp;lt;/A&amp;gt; last week.
&lt;br&gt;&amp;lt;BR&amp;gt;I have got the product. Its quality is very good and the price
&lt;br&gt;is&amp;lt;BR&amp;gt;competitive. They also sell phones, TV, psp, motor and so on. By
&lt;br&gt;the&amp;lt;BR&amp;gt;way, they import product from Korea and sell new and
&lt;br&gt;original&amp;lt;BR&amp;gt;products. They have good reputation and have many good
&lt;br&gt;feedbacks. If&amp;lt;BR&amp;gt;you need these products, look at this website will be
&lt;br&gt;a clever choice.&amp;lt;BR&amp;gt;I am sure you will get many surprise and
&lt;br&gt;benefits.&amp;lt;BR&amp;gt;Greetings!
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/dear-tp26294598p26294598.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26294062</id>
	<title>Apache Hadoop Get Together Berlin - December 2009</title>
	<published>2009-11-10T16:35:28Z</published>
	<updated>2009-11-10T16:35:28Z</updated>
	<author>
		<name>Isabel Drost-4</name>
	</author>
	<content type="html">&lt;br&gt;As announced at ApacheCon US, the next Apache Hadoop Get Together Berlin is 
&lt;br&gt;scheduled for December 2009.
&lt;br&gt;&lt;br&gt;When: Wednesday December 16, 2009 &amp;nbsp;at 5:00pm 
&lt;br&gt;Where: newthinking store, Tucholskystr. 48, Berlin
&lt;br&gt;&lt;br&gt;As always there will be slots of 20min each for talks on your Hadoop topic. 
&lt;br&gt;After each talk there will be a lot time to discuss. You can order drinks 
&lt;br&gt;directly at the bar in the newthinking store. If you like, you can order 
&lt;br&gt;pizza. We will go to Cafe Aufsturz after the event for some beer and 
&lt;br&gt;something to eat.
&lt;br&gt;&lt;br&gt;Talks scheduled so far:
&lt;br&gt;&lt;br&gt;Richard Hutton (nugg.ad): &amp;quot;Moving from five days to one hour.&amp;quot; - This talk 
&lt;br&gt;explains how we made data processing scalable at nugg.ad. The company's core 
&lt;br&gt;business is online advertisement targeting. Our servers receive 10,000 
&lt;br&gt;requests per second resulting in data of 100GB per day.
&lt;br&gt;&lt;br&gt;As the classical data warehouse solution reached its limit, we moved to a 
&lt;br&gt;framework built on top of Hadoop to make analytics speedy,data mining 
&lt;br&gt;detailed and all of our lives easier. We will give an overview of our 
&lt;br&gt;solution involving file system structures, scheduling, messaging and 
&lt;br&gt;programming languages from the future.
&lt;br&gt;&lt;br&gt;Jörg Möllenkamp (Sun): &amp;quot;Hadoop on Sun&amp;quot;
&lt;br&gt;Abstract: Hadoop is a well known technology inside of Sun. This talk want to 
&lt;br&gt;show some interesting use cases of Hadoop in conjunction with Sun 
&lt;br&gt;technologies. The first show case wants to demonstrate how Hadoop can used to 
&lt;br&gt;load massive multicore system with up to 256 threads in a single system to 
&lt;br&gt;the max. The second use case shows how several mechanisms integrated in 
&lt;br&gt;Solaris can ease the deployment and operation of Hadoop even in non-dedicated 
&lt;br&gt;environments. The last usecase will show the combination of the Sun Grid 
&lt;br&gt;Engine and Hadoop. Talk may contain command-line demonstrations ;).
&lt;br&gt;&lt;br&gt;Nikolaus Pohle (nurago): &amp;quot;M/R for MR - Online Market Research powered by 
&lt;br&gt;Apache Hadoop. Enable consultants to analyze online behavior for audience 
&lt;br&gt;segmentation, advertising effects and usage patterns.&amp;quot;
&lt;br&gt;&lt;br&gt;We would like to invite you, the visitor to also tell your Hadoop story, if 
&lt;br&gt;you like, you can bring slides - there will be a beamer.
&lt;br&gt;&lt;br&gt;A big Thanks goes to the newthinking store for providing a room in the center 
&lt;br&gt;of Berlin for us. Another big thanks goes to StudiVZ for sponsoring videos of 
&lt;br&gt;the talks. Links to the videos will be posted here as well as on the StudiVZ 
&lt;br&gt;blog.
&lt;br&gt;&lt;br&gt;Please do indicate on the following upcoming event if you are planning to 
&lt;br&gt;attend to make planning (and booking tables at Aufsturz) easier:
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://upcoming.yahoo.com/event/4842528/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://upcoming.yahoo.com/event/4842528/&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Looking forward to seeing you in Berlin,
&lt;br&gt;Isabel
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;&amp;nbsp; |\ &amp;nbsp; &amp;nbsp; &amp;nbsp;_,,,---,,_ &amp;nbsp; &amp;nbsp; &amp;nbsp; Web: &amp;nbsp; &amp;lt;&lt;a href=&quot;http://www.isabel-drost.de&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.isabel-drost.de&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;nbsp; /,`.-'`' &amp;nbsp; &amp;nbsp;-. &amp;nbsp;;-;;,_ &amp;nbsp;
&lt;br&gt;&amp;nbsp;|,4- &amp;nbsp;) )-,_..;\ ( &amp;nbsp;`'-' 
&lt;br&gt;'---''(_/--' &amp;nbsp;`-'\_) (fL) &amp;nbsp;IM: &amp;nbsp;&amp;lt;xmpp://&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26294062&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;MaineC.@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br /&gt; &lt;div class=&quot;small&quot;&gt;&lt;br/&gt;&lt;img src=&quot;http://old.nabble.com/images/icon_attachment.gif&quot; &gt; &lt;strong&gt;signature.asc&lt;/strong&gt; (204 bytes) &lt;a href=&quot;http://old.nabble.com/attachment/26294062/0/signature.asc&quot; target=&quot;_top&quot;&gt;Download Attachment&lt;/a&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Apache-Hadoop-Get-Together-Berlin---December-2009-tp26294062p26294062.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26294013</id>
	<title>Re: How to make a Lucene-built index work with Nutch?</title>
	<published>2009-11-10T16:30:30Z</published>
	<updated>2009-11-10T16:30:30Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">&lt;br&gt;hi,
&lt;br&gt;&lt;br&gt;not sure if this will work in your case;
&lt;br&gt;&lt;br&gt;but in a nutshell -
&lt;br&gt;&lt;br&gt;first create a nutch index by crawling some urls.
&lt;br&gt;&lt;br&gt;open both indexes ie
&lt;br&gt;IndexReader r = IndexReader.open(nutch_index)
&lt;br&gt;IndexReader r2 = IndexReader.open(your_index)
&lt;br&gt;&lt;br&gt;then write a new index;
&lt;br&gt;&lt;br&gt;IndexWriter writer = new IndexWriter(new_index,...)
&lt;br&gt;writer.addIndexes(new IndexReader[]{r,r2})
&lt;br&gt;&lt;br&gt;replace your nutch index with new_index.
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have built a custom index using Lucene, as the data source is not web
&lt;br&gt;&amp;gt; pages
&lt;br&gt;&amp;gt; but some custom text files. I stored several indexed fields in my index.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Now my colleague requires me to make sure this index works with Nutch. So
&lt;br&gt;&amp;gt; I
&lt;br&gt;&amp;gt; set up Nutch on the server, followed the steps, crawled some pages and
&lt;br&gt;&amp;gt; build
&lt;br&gt;&amp;gt; an index, and use that index for search, everything is fine.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Then I tried to replace the Nutch-built index to my Lucene-built one, by
&lt;br&gt;&amp;gt; changing the &amp;lt;searcher.dir&amp;gt; value in conf/nutch-site.xml. It does not
&lt;br&gt;&amp;gt; return
&lt;br&gt;&amp;gt; anything. I know some extra work needs to be done to get it work, but
&lt;br&gt;&amp;gt; don't
&lt;br&gt;&amp;gt; know how.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I've read some examples for building a plugin, but my problem seems a
&lt;br&gt;&amp;gt; little
&lt;br&gt;&amp;gt; different. I don't think I need to build a IndexFilter extension, but
&lt;br&gt;&amp;gt; maybe
&lt;br&gt;&amp;gt; only a QueryFilter or QueryParser, but not sure about any details.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Newbie to Nutch, and thanks a lot for your help!
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; View this message in context:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/How-to-make-a-Lucene-built-index-work-with-Nutch--tp26286011p26286011.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://old.nabble.com/How-to-make-a-Lucene-built-index-work-with-Nutch--tp26286011p26286011.html&lt;/a&gt;&lt;br&gt;&amp;gt; Sent from the Nutch - User mailing list archive at Nabble.com.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-make-a-Lucene-built-index-work-with-Nutch--tp26286011p26294013.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26292704</id>
	<title>Re: How do I block/ban a specific domain name or a tld?</title>
	<published>2009-11-10T14:43:36Z</published>
	<updated>2009-11-10T14:43:36Z</updated>
	<author>
		<name>reinhard schwab</name>
	</author>
	<content type="html">opsec schrieb:
&lt;br&gt;&amp;gt; I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt
&lt;br&gt;&amp;gt; yet when I start a crawl this domain is heavily spidered. I would like to
&lt;br&gt;&amp;gt; remove it from my search results entirely and prevent it from being crawled
&lt;br&gt;&amp;gt; in the future and possibly all *.int tlds, how can i accomplish this?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -^&lt;a href=&quot;http://(&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://(&lt;/a&gt;[a-z0-9]*\.)*who.int/
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;why not
&lt;br&gt;&lt;br&gt;-^http://[^/]*\.int/
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;gt; Thanks for your time and any assistance, 
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -Warren
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26292704.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26289091</id>
	<title>How do I block/ban a specific domain name or a tld?</title>
	<published>2009-11-10T12:26:52Z</published>
	<updated>2009-11-10T12:26:52Z</updated>
	<author>
		<name>opsec</name>
	</author>
	<content type="html">I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt yet when I start a crawl this domain is heavily spidered. I would like to remove it from my search results entirely and prevent it from being crawled in the future and possibly all *.int tlds, how can i accomplish this?
&lt;br&gt;&lt;br&gt;-^&lt;a href=&quot;http://(&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://(&lt;/a&gt;[a-z0-9]*\.)*who.int/
&lt;br&gt;&lt;br&gt;Thanks for your time and any assistance, 
&lt;br&gt;&lt;br&gt;-Warren</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26289091.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26286011</id>
	<title>How to make a Lucene-built index work with Nutch?</title>
	<published>2009-11-10T08:11:47Z</published>
	<updated>2009-11-10T08:11:47Z</updated>
	<author>
		<name>Wang Muyuan</name>
	</author>
	<content type="html">I have built a custom index using Lucene, as the data source is not web pages but some custom text files. I stored several indexed fields in my index.
&lt;br&gt;&lt;br&gt;Now my colleague requires me to make sure this index works with Nutch. So I set up Nutch on the server, followed the steps, crawled some pages and build an index, and use that index for search, everything is fine.
&lt;br&gt;&lt;br&gt;Then I tried to replace the Nutch-built index to my Lucene-built one, by changing the &amp;lt;searcher.dir&amp;gt; value in conf/nutch-site.xml. It does not return anything. I know some extra work needs to be done to get it work, but don't know how. 
&lt;br&gt;&lt;br&gt;I've read some examples for building a plugin, but my problem seems a little different. I don't think I need to build a IndexFilter extension, but maybe only a QueryFilter or QueryParser, but not sure about any details.
&lt;br&gt;&lt;br&gt;Newbie to Nutch, and thanks a lot for your help!</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-make-a-Lucene-built-index-work-with-Nutch--tp26286011p26286011.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26278066</id>
	<title>Re: changing/addding field in existing index</title>
	<published>2009-11-09T20:15:06Z</published>
	<updated>2009-11-09T20:15:06Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">that seems to work. thanks for that. it was a bit fiddly more than i
&lt;br&gt;expected but got the index sorted.
&lt;br&gt;&lt;br&gt;found an issue with sorting as most fields cannot be sorted by; and
&lt;br&gt;throwing a 
&lt;br&gt;&lt;br&gt;java.lang.RuntimeException: Unknown sort value type!
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:159)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:98)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.searcher.LuceneSearchBean.search(LuceneSearchBean.java:84)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:231)
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;On Mon, 2009-11-09 at 17:34 +0100, Andrzej Bialecki wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26278066&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;fadzi@...&lt;/a&gt; wrote:
&lt;br&gt;&amp;gt; &amp;gt; hi all,
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; i have an existing index - we have a custom field that needs to be added
&lt;br&gt;&amp;gt; &amp;gt; or changed in every currently indexed document ;
&lt;br&gt;&amp;gt; &amp;gt; 
&lt;br&gt;&amp;gt; &amp;gt; whats the best way to go about this without recreating the index again?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; There are ways to do it directly on the index, but this is complicated 
&lt;br&gt;&amp;gt; and involves hacking the low-level Lucene format. Alternatively, you 
&lt;br&gt;&amp;gt; could build a parallel index with just these fields, but synchronized 
&lt;br&gt;&amp;gt; internal docId-s, open both indexes with ParallelReader, and then create 
&lt;br&gt;&amp;gt; a new index using IndexWriter.addIndexes().
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I suggest recreating the index.
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/changing-addding-field-in-existing-index-tp26260926p26278066.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26273548</id>
	<title>Cannot get slave nodes to run</title>
	<published>2009-11-09T12:57:11Z</published>
	<updated>2009-11-09T12:57:11Z</updated>
	<author>
		<name>kvorion</name>
	</author>
	<content type="html">Hi All...
&lt;br&gt;&lt;br&gt;I have been trying to set up nutch on a cluster of 3 machines. I could get the crawling and searching process to run independently on all 3 machines but when I try to integrate them as a single cluster, then none of the slaves are shown in the listing of nodes on the Hadoop Machine List page (&lt;a href=&quot;http://master:50030/machines.jsp&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://master:50030/machines.jsp&lt;/a&gt;).
&lt;br&gt;&lt;br&gt;There are no exceptions in any of the log files, and several log files are even empty. I am out of ideas. I will appreciate any help. Thanks in advance.
&lt;br&gt;&lt;br&gt;Here is my hadoop-site.xml (identical on all nodes)
&lt;br&gt;&lt;br&gt;&amp;lt;?xml-stylesheet type=&amp;quot;text/xsl&amp;quot; href=&amp;quot;configuration.xsl&amp;quot;?&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;!-- Put site-specific property overrides in this file. --&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;configuration&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;hadoop.tmp.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/tmp/&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;hdfs://athena:9000&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;athena:9001&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;2&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/name&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/data&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/system&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;property&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;lt;value&amp;gt;/nutch/filesystem/mapreduce/local&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;lt;/property&amp;gt;
&lt;br&gt;&lt;br&gt;&amp;lt;/configuration&amp;gt;
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Cannot-get-slave-nodes-to-run-tp26273548p26273548.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26269370</id>
	<title>Re: changing/addding field in existing index</title>
	<published>2009-11-09T08:34:50Z</published>
	<updated>2009-11-09T08:34:50Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26269370&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;fadzi@...&lt;/a&gt; wrote:
&lt;br&gt;&amp;gt; hi all,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; i have an existing index - we have a custom field that needs to be added
&lt;br&gt;&amp;gt; or changed in every currently indexed document ;
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; whats the best way to go about this without recreating the index again?
&lt;br&gt;&lt;br&gt;There are ways to do it directly on the index, but this is complicated 
&lt;br&gt;and involves hacking the low-level Lucene format. Alternatively, you 
&lt;br&gt;could build a parallel index with just these fields, but synchronized 
&lt;br&gt;internal docId-s, open both indexes with ParallelReader, and then create 
&lt;br&gt;a new index using IndexWriter.addIndexes().
&lt;br&gt;&lt;br&gt;I suggest recreating the index.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/changing-addding-field-in-existing-index-tp26260926p26269370.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26269228</id>
	<title>Nutch near future - strategic directions</title>
	<published>2009-11-09T08:24:06Z</published>
	<updated>2009-11-09T08:24:06Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Hi all,
&lt;br&gt;&lt;br&gt;The ApacheCon is over, our release 1.0 has been out already for some 
&lt;br&gt;time, so I think it's a good moment to discuss what are the next steps 
&lt;br&gt;in Nutch development.
&lt;br&gt;&lt;br&gt;Let me share with you the topics I identified and presented in the 
&lt;br&gt;ApacheCon slides, and some topics that are worth discussing based on 
&lt;br&gt;various conversations I had there, and the discussions we had on our 
&lt;br&gt;mailing list:
&lt;br&gt;&lt;br&gt;1. Avoid duplication of effort
&lt;br&gt;------------------------------
&lt;br&gt;Currently we spend significant effort on implementing functionality that 
&lt;br&gt;other projects are dedicated to. Instead of doing the same work, and 
&lt;br&gt;sometimes poorly, we should concentrate on delegating and reusing:
&lt;br&gt;&lt;br&gt;* Use Tika for content parsing: this will require some effort and 
&lt;br&gt;collaboration with the Tika project, to improve Tika's ability to handle 
&lt;br&gt;more complex formats well (e.g. hierarchical compound documents such as 
&lt;br&gt;archives, mailboxes, RSS), and to contribute any missing parsers (e.g. 
&lt;br&gt;parse-swf).
&lt;br&gt;&lt;br&gt;* Use Solr for indexing &amp; search: it is hard to justify the effort of 
&lt;br&gt;developing and maintaining our own search server - Solr offers much more 
&lt;br&gt;functionality, configurability, performance and ease of integration than 
&lt;br&gt;our relatively primitive search server. Our integration with Solr needs 
&lt;br&gt;to be improved so that it's easier to setup and operate.
&lt;br&gt;&lt;br&gt;* Use database-like storage abstraction: this may seem like a serious 
&lt;br&gt;departure from the current architecture, but I don't mean that we should 
&lt;br&gt;switch to an SQL DB ... what this means is that we should provide an 
&lt;br&gt;option to use HBase, as well as the current plain MapFile-s (and perhaps 
&lt;br&gt;other types of DBs, such as Berkeley DB or SQL, if it makes sense) as 
&lt;br&gt;our storage. There is a very promising initial port of Nutch to HBase, 
&lt;br&gt;which is currently closely integrated with HBase API (which is both good 
&lt;br&gt;and bad) - it provides several improvements over our current storage, so 
&lt;br&gt;I think it's worth using as the new default, but let's see if we can 
&lt;br&gt;make it more abstract.
&lt;br&gt;&lt;br&gt;* Plugins: the initial OSGI port looks good, but I'm not sure yet at 
&lt;br&gt;this moment if the benefits of OSGI outweigh the cost of this change ...
&lt;br&gt;&lt;br&gt;* Shard management: this is currently an Achilles' heel of Nutch, where 
&lt;br&gt;users are left on their own ... If we switch to using HBase then at 
&lt;br&gt;least on the crawling side the shard management will become much easier. 
&lt;br&gt;This still leaves the problem of deploying new content to search 
&lt;br&gt;server(s). The candidate framework for this side of the shard management 
&lt;br&gt;is Katta + patches provided by Ted Dunning (see ???). If we switch to 
&lt;br&gt;using Solr we would have to &amp;nbsp;also use the Katta / Solr integration, and 
&lt;br&gt;perhaps Solr/Hadoop integration as well. This is a complex mix of 
&lt;br&gt;half-ready components that needs to be well thought-through ...
&lt;br&gt;&lt;br&gt;* Crawler Commons: during our Crawler MeetUp all representatives agreed 
&lt;br&gt;that we should collect a few components that are nearly the same across 
&lt;br&gt;all projects and collaborate on their development, and use them as an 
&lt;br&gt;external dependency. The candidate components are:
&lt;br&gt;&lt;br&gt;&amp;nbsp; - robots.txt parsing
&lt;br&gt;&amp;nbsp; - URL filtering and normalization
&lt;br&gt;&amp;nbsp; - page signature (fingerprint) implementations
&lt;br&gt;&amp;nbsp; - page template detection &amp; removal (aka. main content extraction)
&lt;br&gt;&amp;nbsp; - possibly others, like URL redirection tracking, PageRank 
&lt;br&gt;calculation, crawler trap detection etc.
&lt;br&gt;&lt;br&gt;2. Make Nutch easier to use
&lt;br&gt;---------------------------
&lt;br&gt;This, as you may remember our earlier discussions, begs the question: 
&lt;br&gt;who is the target audience of Nutch?
&lt;br&gt;&lt;br&gt;In my opinion, the main users of Nutch are vertical search engines, and 
&lt;br&gt;this is the audience that we should cater to. There are many reasons for 
&lt;br&gt;this:
&lt;br&gt;&lt;br&gt;- Nutch is too complex and too heavy for those that need to crawl up to 
&lt;br&gt;a few thousand pages. Now that the Droids project exists it's probably 
&lt;br&gt;not worth the effort to attempt a complete re-design of Nutch so that it 
&lt;br&gt;fits the need of this group - Nutch is based on map-reduce, and it's not 
&lt;br&gt;likely we would want to change that, so this means there will always be 
&lt;br&gt;a significant overhead for small jobs. I'm not saying we should not make 
&lt;br&gt;Nutch easier to use, but for small crawls Nutch is an overkill. Also, in 
&lt;br&gt;many cases these users don't realize that they don't do any frontier 
&lt;br&gt;discovery and expansion, and what they really need is Solr.
&lt;br&gt;&lt;br&gt;- at the other end of the spectrum, there are very very few companies 
&lt;br&gt;that want to do a wide large web-scale crawling - this is costly, and 
&lt;br&gt;requires a solid business plan and serious funding. These users are 
&lt;br&gt;prepared anyway to spend significant effort on customizations and 
&lt;br&gt;problem-solving, or they may want to use only some parts of Nutch. Often 
&lt;br&gt;they are also not too eager to contribute back to the project - either 
&lt;br&gt;because of their proprietary nature or because their customizations are 
&lt;br&gt;not useful for general audience.
&lt;br&gt;&lt;br&gt;The remaining group is interested in medium-size, high quality crawling 
&lt;br&gt;(focused, with good spam &amp; junk controls). Which is either an enterprise 
&lt;br&gt;search or a vertical search. We should make Nutch an attractive platform 
&lt;br&gt;for such users, and we should discuss what this entails. Also, if we 
&lt;br&gt;refactor Nutch in the way I described above, it will be easier for such 
&lt;br&gt;users to contribute back to Nutch and other related projects.
&lt;br&gt;&lt;br&gt;3. Provide a platform for solving the really interesting issues
&lt;br&gt;---------------------------------------------------------------
&lt;br&gt;Nutch has many bits and pieces that implement really smart algorithms 
&lt;br&gt;and heuristics to solve difficult issues that occur in crawling. The 
&lt;br&gt;problem is that they are often well hidden and poorly documented, and 
&lt;br&gt;their interaction with the rest of the system is far from obvious. 
&lt;br&gt;Sometimes this is related to premature performance optimizations, in 
&lt;br&gt;other cases this is just a poorly abstracted design. Examples would 
&lt;br&gt;include the OPIC scoring, meta-tags &amp; metadata handling, deduplication, 
&lt;br&gt;redirection handling, etc.
&lt;br&gt;&lt;br&gt;Even though these components are usually implemented as plugins, this 
&lt;br&gt;lack of transparency and poor design makes it difficult to experiment 
&lt;br&gt;with Nutch. I believe that improving this area will result in many more 
&lt;br&gt;users contributing back to the project, both from business and from 
&lt;br&gt;academia.
&lt;br&gt;&lt;br&gt;And there are quite a few interesting challenges to solve:
&lt;br&gt;&lt;br&gt;* crawl scheduling, i.e. determining the order and composition of 
&lt;br&gt;fetchlists to maximize the crawling speed.
&lt;br&gt;&lt;br&gt;* spam &amp; junk detection (I won't go into details on this, there are tons 
&lt;br&gt;of literature on the subject)
&lt;br&gt;&lt;br&gt;* crawler trap handling (e.g. the classic calendar page that generates 
&lt;br&gt;infinite number of pages).
&lt;br&gt;&lt;br&gt;* enterprise-specific ranking and scoring. This includes users' feedback 
&lt;br&gt;(explicit and implicit, e.g. click-throughs)
&lt;br&gt;&lt;br&gt;* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)
&lt;br&gt;&lt;br&gt;* near-duplicate detection, and closely related issue of extraction of 
&lt;br&gt;the main content from a templated page.
&lt;br&gt;&lt;br&gt;* URL aliasing (e.g. www.a.com == a.com == a.com/index.html == 
&lt;br&gt;a.com/default.asp), and what happens with inlinks to such aliased pages. 
&lt;br&gt;Also related to this is the problem of temporary/permanent redirects and 
&lt;br&gt;complete mirrors.
&lt;br&gt;&lt;br&gt;Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an 
&lt;br&gt;attractive platform to develop and experiment with such components.
&lt;br&gt;&lt;br&gt;-----------------
&lt;br&gt;Briefly ;) that's what comes to my mind when I think about the future of 
&lt;br&gt;Nutch. I invite you all to share your thoughts and suggestions!
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch-near-future---strategic-directions-tp26269228p26269228.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26269033</id>
	<title>RE: Simple vertical search engine question</title>
	<published>2009-11-09T08:11:13Z</published>
	<updated>2009-11-09T08:11:13Z</updated>
	<author>
		<name>Funtick</name>
	</author>
	<content type="html">Premium Google publishers (&amp;gt;20 mlns pageviews per month) may use more
&lt;br&gt;features of AdSense such as explicit keywords in a query (to Google)
&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;gt; -----Original Message-----
&lt;br&gt;&amp;gt; From: Carlos Vera [mailto:&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26269033&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;carlodesilva2@...&lt;/a&gt;]
&lt;br&gt;&amp;gt; Sent: November-09-09 10:53 AM
&lt;br&gt;&amp;gt; To: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26269033&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;nutch-user@...&lt;/a&gt;
&lt;br&gt;&amp;gt; Subject: Simple vertical search engine question
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I have looked into few vertical search engines like indeed.com,
&lt;br&gt;&amp;gt; simplyhired.com. &amp;nbsp;Anyone know how vertical search engine like indeed.com
&lt;br&gt;and
&lt;br&gt;&amp;gt; simplyhired.com displays relevant google ads for the searched keywords on
&lt;br&gt;&amp;gt; thier site?
&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Simple-vertical-search-engine-question-tp26268714p26269033.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26268948</id>
	<title>Re: PRUNE : need some help on pruning syntax.</title>
	<published>2009-11-09T08:06:54Z</published>
	<updated>2009-11-09T08:06:54Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">one option is to extend the html parser and look for these things and
&lt;br&gt;ignore them.
&lt;br&gt;&lt;br&gt;you might also want to look at this forum posting:
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://www.mail-archive.com/nutch-user@lucene.apache.org/msg13969.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.mail-archive.com/nutch-user@.../msg13969.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;On Mon, 2009-11-09 at 07:39 -0800, Annappa wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; I am unsing Nutch-0.9 for crawing of &amp;nbsp;sime web application which has a
&lt;br&gt;&amp;gt; header part, menu part , left navigation and main contetn area. 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; When i do a search on a perticular key word and if that appears in the main
&lt;br&gt;&amp;gt; menu, then results are repeating as many times as &amp;nbsp;pages are, &amp;nbsp;bcz the menu
&lt;br&gt;&amp;gt; will be included in all the pages. So i need to restrict my search not to
&lt;br&gt;&amp;gt; search with the content of a perticular div
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; ex : &amp;lt;div class=&amp;quot;menu&amp;quot;&amp;gt; ................ &amp;nbsp; &amp;lt;/div&amp;gt;.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Ho do i remove the content between a div from a search
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/PRUNE-%3A-need-some-help-on-pruning-syntax.-tp26268447p26268948.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26268714</id>
	<title>Simple vertical search engine question</title>
	<published>2009-11-09T07:52:32Z</published>
	<updated>2009-11-09T07:52:32Z</updated>
	<author>
		<name>Carlos Vera</name>
	</author>
	<content type="html">I have looked into few vertical search engines like indeed.com,
&lt;br&gt;simplyhired.com. &amp;nbsp;Anyone know how vertical search engine like indeed.com and
&lt;br&gt;simplyhired.com displays relevant google ads for the searched keywords on
&lt;br&gt;thier site?
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Simple-vertical-search-engine-question-tp26268714p26268714.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26268447</id>
	<title>PRUNE : need some help on pruning syntax.</title>
	<published>2009-11-09T07:39:07Z</published>
	<updated>2009-11-09T07:39:07Z</updated>
	<author>
		<name>Annappa</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;I am unsing Nutch-0.9 for crawing of &amp;nbsp;sime web application which has a header part, menu part , left navigation and main contetn area. 
&lt;br&gt;&lt;br&gt;When i do a search on a perticular key word and if that appears in the main menu, then results are repeating as many times as &amp;nbsp;pages are, &amp;nbsp;bcz the menu will be included in all the pages. So i need to restrict my search not to search with the content of a perticular div
&lt;br&gt;&lt;br&gt;ex : &amp;lt;div class=&amp;quot;menu&amp;quot;&amp;gt; ................ &amp;nbsp; &amp;lt;/div&amp;gt;.
&lt;br&gt;&lt;br&gt;&lt;br&gt;Ho do i remove the content between a div from a search
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/PRUNE-%3A-need-some-help-on-pruning-syntax.-tp26268447p26268447.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26261035</id>
	<title>Re: Growing the index : Merging vs incremental</title>
	<published>2009-11-08T20:10:01Z</published>
	<updated>2009-11-08T20:10:01Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">hi,
&lt;br&gt;&lt;br&gt;we are in a sort of similar situation;
&lt;br&gt;&lt;br&gt;So would be really happy to hear any suggestions on this.
&lt;br&gt;&lt;br&gt;incremental crawling doesnt seem to really work for us because it seems
&lt;br&gt;the same urls are being crawled over and over (on a daily basis!);
&lt;br&gt;&lt;br&gt;have you tried these settings or similar?
&lt;br&gt;&lt;br&gt;db.fetch.schedule.class = AdaptiveFetchSchedule
&lt;br&gt;db.update.additions.allowed = true
&lt;br&gt;db.ignore.internal.links = false
&lt;br&gt;db.ignore.external.links = true (because we are intranet only)
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Currently we crawl every two days and create a new index and then merge
&lt;br&gt;&amp;gt; with
&lt;br&gt;&amp;gt; earlier index. For one it takes &amp;nbsp;too long as mergesegs seems to take time
&lt;br&gt;&amp;gt; proportional to the size of both indexes combined. &amp;nbsp;Equally problematic
&lt;br&gt;&amp;gt; issue is mergesegs fail a significant portion of the time. Probability
&lt;br&gt;&amp;gt; becomes higher with size.Problems exist whether merge is done within
&lt;br&gt;&amp;gt; Hadoop
&lt;br&gt;&amp;gt; or outside.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Two questions:
&lt;br&gt;&amp;gt; (a) Has anybody been successful to do a Nutch merge predictably
&lt;br&gt;&amp;gt; irrespective
&lt;br&gt;&amp;gt; of the size. Any tips. &amp;nbsp;We are trying to merge upto data for 200K url at a
&lt;br&gt;&amp;gt; time.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; (b) How can we do incremental indexing, where we add data from latest
&lt;br&gt;&amp;gt; crawl,
&lt;br&gt;&amp;gt; but there is only one index that keeps growing. &amp;nbsp;I saw lot of older posts
&lt;br&gt;&amp;gt; regarding incremental indexing and no clear answers.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks in advance for your help.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Shreekanth
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; View this message in context:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://old.nabble.com/Growing-the-index-%3A-Merging-vs-incremental-tp26228341p26228341.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://old.nabble.com/Growing-the-index-%3A-Merging-vs-incremental-tp26228341p26228341.html&lt;/a&gt;&lt;br&gt;&amp;gt; Sent from the Nutch - User mailing list archive at Nabble.com.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Growing-the-index-%3A-Merging-vs-incremental-tp26228341p26261035.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26260926</id>
	<title>changing/addding field in existing index</title>
	<published>2009-11-08T19:53:15Z</published>
	<updated>2009-11-08T19:53:15Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">hi all,
&lt;br&gt;&lt;br&gt;i have an existing index - we have a custom field that needs to be added
&lt;br&gt;or changed in every currently indexed document ;
&lt;br&gt;&lt;br&gt;whats the best way to go about this without recreating the index again?
&lt;br&gt;&lt;br&gt;currently some documents have the field some dont;
&lt;br&gt;&lt;br&gt;the ones that have it need to be updated - eventualy this field will be
&lt;br&gt;used for sorting.
&lt;br&gt;&lt;br&gt;Thanks;
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/changing-addding-field-in-existing-index-tp26260926p26260926.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26252312</id>
	<title>Re: MergeSegments - java.lang.OutOfMemoryError</title>
	<published>2009-11-08T02:19:28Z</published>
	<updated>2009-11-08T02:19:28Z</updated>
	<author>
		<name>Julien Nioche-4</name>
	</author>
	<content type="html">Hi guys,
&lt;br&gt;&lt;br&gt;Could you send a stack trace of the process? Have you tried using a profiler
&lt;br&gt;to check where the memory was used?
&lt;br&gt;Check &lt;a href=&quot;http://hadoop.apache.org/common/docs/current/mapred_tutorial.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hadoop.apache.org/common/docs/current/mapred_tutorial.html&lt;/a&gt;&amp;nbsp;for
&lt;br&gt;instructions on how to profile with Hadoop in (pseudo) distributed mode.
&lt;br&gt;&lt;br&gt;Julien
&lt;br&gt;-- 
&lt;br&gt;DigitalPebble Ltd
&lt;br&gt;&lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;2009/11/8 Fadzi Ushewokunze &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26252312&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;fadzi@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; i have a similar issue; i havent been able to get to the bottom of it.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Sat, 2009-11-07 at 23:31 -0500, kevin chen wrote:
&lt;br&gt;&amp;gt; &amp;gt; Hi, I have using a trunk version of nutch since Jul 2007. It's being
&lt;br&gt;&amp;gt; &amp;gt; running fine since.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Recently I am experimenting with nutch 1.0. Everything worked great and
&lt;br&gt;&amp;gt; &amp;gt; better until I start to use MergeSegments. &amp;nbsp;I was merging segments with
&lt;br&gt;&amp;gt; &amp;gt; around 20k urls and it gave me OutOfMemoryError. I have tried to
&lt;br&gt;&amp;gt; &amp;gt; increase the java heap max to 3G, I still got OutOfMemoryError. &amp;nbsp;In
&lt;br&gt;&amp;gt; &amp;gt; contrast, in my older version of nutch, &amp;nbsp;same merge works with the
&lt;br&gt;&amp;gt; &amp;gt; default java heap max setting of only 1G.
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Dose anybody have the same experience? Is there any work around this?
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt; &amp;gt; Thanks
&lt;br&gt;&amp;gt; &amp;gt; Kevin Chen
&lt;br&gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/MergeSegments---java.lang.OutOfMemoryError-tp26250966p26252312.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26251990</id>
	<title>Re: MergeSegments - java.lang.OutOfMemoryError</title>
	<published>2009-11-08T01:23:53Z</published>
	<updated>2009-11-08T01:23:53Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">i have a similar issue; i havent been able to get to the bottom of it. 
&lt;br&gt;&lt;br&gt;On Sat, 2009-11-07 at 23:31 -0500, kevin chen wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi, I have using a trunk version of nutch since Jul 2007. It's being
&lt;br&gt;&amp;gt; running fine since.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Recently I am experimenting with nutch 1.0. Everything worked great and
&lt;br&gt;&amp;gt; better until I start to use MergeSegments. &amp;nbsp;I was merging segments with
&lt;br&gt;&amp;gt; around 20k urls and it gave me OutOfMemoryError. I have tried to
&lt;br&gt;&amp;gt; increase the java heap max to 3G, I still got OutOfMemoryError. &amp;nbsp;In
&lt;br&gt;&amp;gt; contrast, in my older version of nutch, &amp;nbsp;same merge works with the
&lt;br&gt;&amp;gt; default java heap max setting of only 1G.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Dose anybody have the same experience? Is there any work around this?
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Thanks
&lt;br&gt;&amp;gt; Kevin Chen
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/MergeSegments---java.lang.OutOfMemoryError-tp26250966p26251990.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26250966</id>
	<title>MergeSegments - java.lang.OutOfMemoryError</title>
	<published>2009-11-07T20:31:21Z</published>
	<updated>2009-11-07T20:31:21Z</updated>
	<author>
		<name>kevin chen-6</name>
	</author>
	<content type="html">Hi, I have using a trunk version of nutch since Jul 2007. It's being
&lt;br&gt;running fine since.
&lt;br&gt;&lt;br&gt;Recently I am experimenting with nutch 1.0. Everything worked great and
&lt;br&gt;better until I start to use MergeSegments. &amp;nbsp;I was merging segments with
&lt;br&gt;around 20k urls and it gave me OutOfMemoryError. I have tried to
&lt;br&gt;increase the java heap max to 3G, I still got OutOfMemoryError. &amp;nbsp;In
&lt;br&gt;contrast, in my older version of nutch, &amp;nbsp;same merge works with the
&lt;br&gt;default java heap max setting of only 1G.
&lt;br&gt;&lt;br&gt;Dose anybody have the same experience? Is there any work around this?
&lt;br&gt;&lt;br&gt;Thanks
&lt;br&gt;Kevin Chen
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/MergeSegments---java.lang.OutOfMemoryError-tp26250966p26250966.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26250181</id>
	<title>Re: What are the configuration parameters to fine tune Nutch performance</title>
	<published>2009-11-07T17:18:21Z</published>
	<updated>2009-11-07T17:18:21Z</updated>
	<author>
		<name>John Whelan</name>
	</author>
	<content type="html">The default tuning parameters are specified in nutch/conf/nutch-default.xml, and can be overridden in nutch/conf/nutch-site.xml. (Or in the crawl command line, but I believe that the 'best practice' is to configure settings in nutch-site.xml.)
&lt;br&gt;&lt;br&gt;My personal belief is that the two most valuable parameters for tuning the crawler are 'fetcher.threads.fetch' and 'fetcher.threads.per.host'. However, there are lots of other parameters for tuning, and you might find more value in some of the timeout parameters. (You might also want to look at tuning you JVM heap space, but I've never seen a real need to tweak it.)
&lt;br&gt;&lt;br&gt;As far as resuming a failed crawl, I don't know of any way to do so. I always discard and restart.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/What-are-the-configuration-parameters-to-fine-tune-Nutch-performance-tp26125943p26250181.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26250010</id>
	<title>Re: can Nutch crawl XLS and XLSX file???</title>
	<published>2009-11-07T16:43:27Z</published>
	<updated>2009-11-07T16:43:27Z</updated>
	<author>
		<name>John Whelan</name>
	</author>
	<content type="html">Nutch can index MS Word, MS Powerpoint, MS Excel, and PDF files. In order for these types to be crawled, you need to have the plugins specified in the plugin.includes value of nutch/conf/nutch-site.xml (values are 'parse-(msexcel|mspowerpoint|msword|pdf)'.)
&lt;br&gt;&lt;br&gt;I was not sure if the new XLSX format was supported, so I looked at nutch/conf/tika-mimetypes.xml. From what I can tell, the XLSX files are not supported (as of the 11/6/2009 build of Nutch), only the following extensions are supported for &amp;nbsp;are:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;mime-type type=&amp;quot;application/vnd.ms-excel&amp;quot;&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;magic priority=&amp;quot;50&amp;quot;&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;match value=&amp;quot;Microsoft Excel 5.0 Worksheet&amp;quot; type=&amp;quot;string&amp;quot;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; offset=&amp;quot;2080&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;/magic&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xls&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xlc&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xll&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xlm&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xlw&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xla&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xlt&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;glob pattern=&amp;quot;*.xld&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;alias type=&amp;quot;application/msexcel&amp;quot; /&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;/mime-type&amp;gt;
&lt;br&gt;&lt;br&gt;...of course, I could be wrong.</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/can-Nutch-crawl-XLS-and-XLSX-file----tp26214541p26250010.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26249368</id>
	<title>Re: How to make nutch crawl within a sub category of an URL?</title>
	<published>2009-11-07T14:44:05Z</published>
	<updated>2009-11-07T14:44:05Z</updated>
	<author>
		<name>John Whelan</name>
	</author>
	<content type="html">If it were me, I'd try the following...
&lt;br&gt;&lt;br&gt;Use '&lt;a href=&quot;http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&amp;sid=396545660'&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&amp;sid=396545660'&lt;/a&gt;&amp;nbsp;as a starting point URL, and set up the following filtering rules (crawl-urlfilter.txt):
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;+^&lt;a href=&quot;http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&amp;sid=396545660&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=list&amp;sid=396545660&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;+^&lt;a href=&quot;http://answers.yahoo.com/question&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://answers.yahoo.com/question&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;-.
&lt;br&gt;&lt;br&gt;This should allow the 'Computers &amp; Internet' page to be crawled, and also allow the associated questions to be crawled, but wouldn't traverse beyond that. In order to be sure, you would also want to limit your crawl depth to 2 or 3.</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-make-nutch-crawl-within-a-sub-category-of-an-URL--tp26175613p26249368.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26248538</id>
	<title>Re: No search results</title>
	<published>2009-11-07T13:02:00Z</published>
	<updated>2009-11-07T13:02:00Z</updated>
	<author>
		<name>John Whelan</name>
	</author>
	<content type="html">By any chance is your WAR file (used for the web server) from a slightly different version of Nutch than your crawler? I have seen that this results in the Nutch page showing up, but no results are listed. Another possibility is that you are not launching your search engine from the correct directory; I believe that you are supposed to launch it from the parent directory of you crawler results directory.</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/No-search-results-tp26145245p26248538.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26245651</id>
	<title>no results for local file crawls?</title>
	<published>2009-11-07T10:16:41Z</published>
	<updated>2009-11-07T10:16:41Z</updated>
	<author>
		<name>John Whelan</name>
	</author>
	<content type="html">Hello,
&lt;br&gt;&lt;br&gt;I'm trying to crawl the local filesystem. It appears that the crawl is successful, but later searches don't display the content. During the crawl, I see the following:
&lt;br&gt;&lt;br&gt;...
&lt;br&gt;fetching file:///c:/test/test.txt
&lt;br&gt;fetching &lt;a href=&quot;http://www.cnn.com/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.cnn.com/&lt;/a&gt;&lt;br&gt;...
&lt;br&gt;&lt;br&gt;I know from this that it is finding the file (otherwise I would get a 404 error), and I know that the protocol-file plugin is configured (otherwise I would get a protocol not found error).My test file contains &amp;quot;Hello World!&amp;quot;, but when I query on 'world', I'll I get is the CCN page in the results.
&lt;br&gt;&lt;br&gt;Anyone have any idea as to what I'm doing wrong? (I've tried this with 2 different 1.1 nightly builds; one from last week and a different one from September. Also, I'm running in a CygWin environment.)
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;John</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/no-results-for-local-file-crawls--tp26245651p26245651.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26244609</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-07T05:02:34Z</published>
	<updated>2009-11-07T05:02:34Z</updated>
	<author>
		<name>Paul Tomblin</name>
	</author>
	<content type="html">On Fri, Nov 6, 2009 at 11:44 PM, Ken Krugler
&lt;br&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26244609&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kkrugler_lists@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&lt;br&gt;&amp;gt; Normally it works fine, but it will fail if you don't have swap space
&lt;br&gt;&amp;gt; allocated because that's factored into the free space calc when the fork
&lt;br&gt;&amp;gt; happens.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; What's the swap space setup for your VPS setup?
&lt;br&gt;&lt;br&gt;There's no swap space.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;&lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26244609.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26243924</id>
	<title>Re: Distributed search, is there a better method?</title>
	<published>2009-11-07T03:29:56Z</published>
	<updated>2009-11-07T03:29:56Z</updated>
	<author>
		<name>Julien Nioche-4</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;Generating from a 100K crawlDB should be quite fast. Have you checked that
&lt;br&gt;the IP resolution is turned off? Do you have any special URL filters that
&lt;br&gt;could take a lot of time to process? Generating and merging tend to take
&lt;br&gt;more and more time as the crawlDB grows but this should not be too much of
&lt;br&gt;an issue at your scale.
&lt;br&gt;&lt;br&gt;Could you dump the stats of your crawlDB and tell us how long the generation
&lt;br&gt;step takes?
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;gt; One problem I've run into so far is the amount of time the generate command
&lt;br&gt;&amp;gt; increases with each iteration. The only item that really seems to grow out
&lt;br&gt;&amp;gt; of control is the unfetched URLs, which is expected with such a small
&lt;br&gt;&amp;gt; sample
&lt;br&gt;&amp;gt; of web pages, but it doesn't make sense to me as to why it would take so
&lt;br&gt;&amp;gt; long to generate a list of 1000 urls to fetch out of a list of 100k. Those
&lt;br&gt;&amp;gt; are small numbers in terms of database and computing in general.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;Julien
&lt;br&gt;-- 
&lt;br&gt;DigitalPebble Ltd
&lt;br&gt;&lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Distributed-search%2C-is-there-a-better-method--tp26241631p26243924.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26243647</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-07T02:43:09Z</published>
	<updated>2009-11-07T02:43:09Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">if you are running under windows try to run the crawler under cygwin.
&lt;br&gt;&lt;br&gt;&lt;br&gt;On Fri, 2009-11-06 at 20:21 -0500, Paul Tomblin wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I'm trying to move my crawler from a shared hosting environment (where
&lt;br&gt;&amp;gt; it kept getting killed off for using too much memory) to a VPS. &amp;nbsp;But
&lt;br&gt;&amp;gt; on the new host, I'm getting the following exception:
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; [ WARN] 01:15:17 (FileSystem.java:&amp;lt;init&amp;gt;:1440)
&lt;br&gt;&amp;gt; uri=file:///
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; javax.security.auth.login.LoginException: Login failed: Cannot run
&lt;br&gt;&amp;gt; program &amp;quot;whoami&amp;quot;: java.io.IOException: error=12, Cannot allocate
&lt;br&gt;&amp;gt; memory
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem$Cache$Key.&amp;lt;init&amp;gt;(FileSystem.java:1438)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:319)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:313)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.crawlIt(Crawler.java:407)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.crawlSite(Crawler.java:381)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.crawlCategory(Crawler.java:255)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.crawl(Crawler.java:166)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.main(Crawler.java:724)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; What is going on here?
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26243647.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26242490</id>
	<title>Re: updatedb is talking long long time</title>
	<published>2009-11-06T22:38:55Z</published>
	<updated>2009-11-06T22:38:55Z</updated>
	<author>
		<name>Kalaimathan Mahenthiran</name>
	</author>
	<content type="html">Hi
&lt;br&gt;&lt;br&gt;I have tried ur suggestion of lowering db.max.outlinks.per.page to a
&lt;br&gt;smaller number. I could not reparse the segment as the segment was
&lt;br&gt;already parsed... I tried modifying some other variables such as
&lt;br&gt;java_heap memory and mapreduce_child_opts values... modifying these
&lt;br&gt;values triggered some exceptions.
&lt;br&gt;&lt;br&gt;Therefore i have generated a new segment (considering maybe something
&lt;br&gt;is wrong with the previous segment). and redoing the fetching process.
&lt;br&gt;Once this is complete then i will try to do updatedb again and see if
&lt;br&gt;that works...
&lt;br&gt;&lt;br&gt;Mathan
&lt;br&gt;On Fri, Nov 6, 2009 at 5:40 AM, Julien Nioche
&lt;br&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt; wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hello Kalaimathan,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Any luck with your updateDB? I would be curious to know if the tricks I
&lt;br&gt;&amp;gt; suggested worked.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; J.
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009/11/3 Julien Nioche &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I
&lt;br&gt;&amp;gt;&amp;gt; am pretty sure that you are running out of memory because of the inlinks
&lt;br&gt;&amp;gt;&amp;gt; which are stored in RAM.
&lt;br&gt;&amp;gt;&amp;gt; Applying the patch NUTCH-702 would also help.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I have modified the CrawlDBReducer and added another parameter *db
&lt;br&gt;&amp;gt;&amp;gt; .fetch.links.max*  :
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; *  switch (datum.getStatus()) {                // collect other info
&lt;br&gt;&amp;gt;&amp;gt;       case CrawlDatum.STATUS_LINKED:
&lt;br&gt;&amp;gt;&amp;gt;         if (maxLinks!=-1 &amp;&amp; linked.size()&amp;gt;= maxLinks) break;
&lt;br&gt;&amp;gt;&amp;gt; *
&lt;br&gt;&amp;gt;&amp;gt; where maxLinks is a variable which I initialize from the configure() method
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; *    maxLinks = job.getInt(&amp;quot;db.fetch.links.max&amp;quot;, -1);*
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I have not tried *db.max.outlinks.per.page *at all but am pretty sure that
&lt;br&gt;&amp;gt;&amp;gt; *db.fetch.links.max* works fine.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; There is also a parameter *db.max.inlinks* but it affects only the
&lt;br&gt;&amp;gt;&amp;gt; LinkDBMerger
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Let us know if that fixes the problem
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Julien
&lt;br&gt;&amp;gt;&amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; 2009/11/3 Kalaimathan Mahenthiran &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mathan55@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I can see that its running out of ram because... before starting
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; updatedb process i have approximately 7.7gb left on the system and as
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; soon as this starts running for some time.. the ram comes to ~48
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; bytes...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; definitely its clogging all the ram space...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; i specified the heap size to be 9 gb.. in the hadoop-site.xml like below
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;lt;property&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;  &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;  &amp;lt;value&amp;gt;-Xmx9096m -XX: -UseGCOverheadLimit&amp;lt;/value&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;lt;/propery&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I have attached a screenshot of the jconsole view of the updatedb
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; process...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; From jconsole i can see that cpu is not getting used at all.. its only
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; being used 0.3~.5%.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; The system i'm using should not be a limitation because its an amd
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; disk space...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; Thanks again for all the help
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=3&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; OK. What heapsize did you specify for this job? Could it be that you are
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; running out of ram and GCing a lot? Still it should not take THAT long
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; Can you see some variations in the stacktraces or are they always
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; pointing
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; at the same things?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; The operations on the metadata take an awful lot of time, which I why I
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; did
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; NUTCH-702, however that does not explain why processing a dataset this
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; size
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; takes 20 days.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; J.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; 2009/11/3 Kalaimathan Mahenthiran &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=4&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mathan55@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; I have lot of space left on the /tmp . I don't have separate partition
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; for /tmp... i have a folder called /tmp... There is lot of space
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; left.. close to 1.3Terabytes...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;                      1.4T   55G  1.3T   5% /
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; tmpfs                 3.8G     0  3.8G   0% /lib/init/rw
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; varrun                3.8G  120K  3.8G   1% /var/run
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; varlock               3.8G     0  3.8G   0% /var/lock
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; udev                  3.8G  152K  3.8G   1% /dev
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; tmpfs                 3.8G     0  3.8G   0% /dev/shm
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; lrm                   3.8G  2.5M  3.8G   1%
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; /lib/modules/2.6.28-15-server/volatile
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; /dev/sda5             228M   29M  187M  14% /boot
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; /dev/sr0              388K  388K     0 100% /media/cdrom0
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; I also noticed that /tmp/hadoop-root directory is 6.8 Gb...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; I have attached the jstack of the process that is doing the update....
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; below
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 19:11:54
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; mode):
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Attach Listener&amp;quot; daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; waiting on condition [0x0000000000000000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Comm thread for attempt_local_0001_r_000000_0&amp;quot; daemon prio=10
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; [0x00007f4000e97000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: TIMED_WAITING (sleeping)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.mapred.Task$1.run(Task.java:403)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Thread-12&amp;quot; prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; [0x00007f4000f98000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Byte.hashCode(Byte.java:394)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        - locked &amp;lt;0x00007f47ef4d9310&amp;gt; (a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.io.MapWritable)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.io.AbstractMapWritable.&amp;lt;init&amp;gt;(AbstractMapWritable.java:128)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.io.MapWritable.&amp;lt;init&amp;gt;(MapWritable.java:42)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.io.MapWritable.&amp;lt;init&amp;gt;(MapWritable.java:52)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Low Memory Detector&amp;quot; daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; runnable [0x0000000000000000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;CompilerThread1&amp;quot; daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; waiting on condition [0x0000000000000000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;CompilerThread0&amp;quot; daemon prio=10 tid=0x00000000417be800 nid=0x25ce
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; waiting on condition [0x0000000000000000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Signal Dispatcher&amp;quot; daemon prio=10 tid=0x00000000417bc800 nid=0x25cd
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; runnable [0x0000000000000000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: RUNNABLE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Finalizer&amp;quot; daemon prio=10 tid=0x000000004179e000 nid=0x25cc in
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; Object.wait() [0x00007f40016f7000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: WAITING (on object monitor)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Object.wait(Native Method)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        - waiting on &amp;lt;0x00007f400f63e6c0&amp;gt; (a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; java.lang.ref.ReferenceQueue$Lock)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        - locked &amp;lt;0x00007f400f63e6c0&amp;gt; (a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; java.lang.ref.ReferenceQueue$Lock)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;Reference Handler&amp;quot; daemon prio=10 tid=0x0000000041797000 nid=0x25cb
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; in Object.wait() [0x00007f40017f8000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: WAITING (on object monitor)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Object.wait(Native Method)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        - waiting on &amp;lt;0x00007f400f63e6f8&amp;gt; (a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; java.lang.ref.Reference$Lock)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Object.wait(Object.java:485)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        - locked &amp;lt;0x00007f400f63e6f8&amp;gt; (a java.lang.ref.Reference$Lock)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;main&amp;quot; prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; [0x00007f49d75c2000]
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;   java.lang.Thread.State: TIMED_WAITING (sleeping)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at java.lang.Thread.sleep(Native Method)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;VM Thread&amp;quot; prio=10 tid=0x0000000041790000 nid=0x25ca runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;GC task thread#0 (ParallelGC)&amp;quot; prio=10 tid=0x000000004173e000
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; nid=0x25c6 runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;GC task thread#1 (ParallelGC)&amp;quot; prio=10 tid=0x0000000041740000
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; nid=0x25c7 runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;GC task thread#2 (ParallelGC)&amp;quot; prio=10 tid=0x0000000041742000
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; nid=0x25c8 runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;GC task thread#3 (ParallelGC)&amp;quot; prio=10 tid=0x0000000041744000
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; nid=0x25c9 runnable
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;quot;VM Periodic Task Thread&amp;quot; prio=10 tid=0x00007f3ffc006800 nid=0x25d1
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; waiting on condition
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; JNI global references: 907
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; Any help related to this would be really helpful...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=5&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lists.digitalpebble@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Hi again
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; i know the process is not stuck.. and the process is running because
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; i
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; turned on the hadoop logs and i can see logs being written to it...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; I'm not sure how to check if the task is completely stuck or not...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; run jps to identify the process id then *jstack id* several times to
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; see
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; if
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; it is blocked at the same place
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; how much space do you have left on the partition where /tmp is
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; mounted?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; J.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; Below is the sample log as i'm sending this email.... Its been on
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; the
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; updatedb process for the last 19 days and the it has been generating
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; debug logs similar to this........
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; Has anyone else has this same issue before...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.hadoop.mapred.Task$Counter with bundle
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; COMBINE_OUTPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_INPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_OUTPUT_BYTES
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_INPUT_BYTES
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; MAP_OUTPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; COMBINE_INPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:21,643 INFO  mapred.JobClient -  map 93% reduce 0%
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:22,121 INFO  mapred.MapTask - Spilling map output:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; record full = true
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:22,121 INFO  mapred.MapTask - bufstart = 10420198;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; bufend = 13893589; bufvoid = 99614720
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:22,121 INFO  mapred.MapTask - kvstart = 131070;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; kvend
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; = 65533; length = 327680
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:22,427 INFO  mapred.MapTask - Finished spill 3
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,301 INFO  mapred.MapTask - Starting flush of map
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; output
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,384 INFO  mapred.MapTask - Finished spill 4
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,390 INFO  mapred.Merger - Merging 5 sorted
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; segments
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,392 INFO  mapred.Merger - Down to the last
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; merge-pass, with 5 segments left of total size: 1192 bytes
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,393 INFO  mapred.MapTask - Index: (0, 354, 358)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,394 INFO  mapred.TaskRunner -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; Task:attempt_local_0001_m_000003_0 is done. And is in the process of
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; commiting
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; attempt_local_0001_m_000003_0 Progress/ping thread exiting since it
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; got interrupted
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,395 INFO  mapred.LocalJobRunner -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.hadoop.mapred.Task$Counter with bundle
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; COMBINE_OUTPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_INPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_OUTPUT_BYTES
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; MAP_INPUT_BYTES
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; MAP_OUTPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; COMBINE_INPUT_RECORDS
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,397 INFO  mapred.TaskRunner - Task
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 'attempt_local_0001_m_000003_0' done.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 0:0
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,397 DEBUG conf.Configuration -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; java.io.IOException: config(config)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; org.apache.hadoop.conf.Configuration.&amp;lt;init&amp;gt;(Configuration.java:192)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;        at org.apache.hadoop.mapred.JobConf.&amp;lt;init&amp;gt;(JobConf.java:139)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;        at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; to
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; /tmp/hadoop-root/mapred/local/localRunner/split.dta
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner -
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; attempt_local_0001_m_000004_0 Progress/ping thread started
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,452 INFO  mapred.MapTask - numReduceTasks: 1
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,453 INFO  mapred.MapTask - io.sort.mb = 100
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 2009-11-02 13:34:23,644 INFO  mapred.JobClient -  map 100% reduce 0%
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; Mathan
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=6&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Kalaimathan Mahenthiran wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; I forgot to add the detail...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; The segment i'm trying to do updatedb on has 1.3 millions urls
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; fetched
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; and 1.08 million urls parsed..
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; Any help related to this would be appreciated...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242490&amp;i=7&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;mathan55@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; hi everyone
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; I'm using nutch 1.0. I have fetched successfully and currently
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; on
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; the
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; updatedb process. I'm doing updatedb and its taking so long. I
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; don't
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; know why its taking this long. I have a new machine with quad
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; core
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; processor and 8 gb of ram.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; I believe this system is really good in terms of processing
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; power. I
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; don't think processing power is the problem here. I noticed that
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; all
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; the ram is getting using up. close to 7.7gb by the updatedb
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; process.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; The computer is becoming is really slow.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; The updatedb process has been running for the last 19 days
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; continually
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; with the message merging segment data into db.. Does anyone know
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; why
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; its taking so long... Is there any configuration setting i can
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; do to
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; increase the speed of the updatedb process...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; First, this process normally takes just a few minutes, depending
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; on
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; the
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; hardware, and not several days - so something is wrong.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; * do you run this in &amp;quot;local&amp;quot; or pseudo-distributed mode (i.e.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; running
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; real
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; jobtracker and tasktracker?) Try the pseudo-distributed mode,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; because
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; then
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; you can monitor the progress in the web UI.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; * how many reduce tasks do you have? with large updates it helps
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; if
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; you
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; run
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; 1 reducer, to split the final sorting.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; * if the task appears to be completely stuck, please generate a
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; thread
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; dump
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; (kill -SIGQUIT) and see where it's stuck. This could be related to
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; urlfilter-regex or urlnormalizer-regex - you can identify if these
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; are
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; problematic by removing them from the config and re-running the
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; operation.
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; * minor issue - when specifying the path names of segments and
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; crawldb,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; do
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; NOT append the trailing slash - it's not harmful in this
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; particular
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; case,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; but you could have a nasty surprise when doing e.g. copy / mv
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; operations
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; ...
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Best regards,
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; Andrzej Bialecki     &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;  ___. ___ ___ ___ _ _   __________________________________
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; ___|||__||  \|  ||  |  Embedded Unix, System Integration
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; DigitalPebble Ltd
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/updatedb-is-talking-long-long-time-tp26158383p26242490.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26242131</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-06T20:44:43Z</published>
	<updated>2009-11-06T20:44:43Z</updated>
	<author>
		<name>Ken Krugler</name>
	</author>
	<content type="html">Hi Paul,
&lt;br&gt;&lt;br&gt;On Nov 6, 2009, at 5:39pm, Paul Tomblin wrote:
&lt;br&gt;&lt;br&gt;&amp;gt; Wait a second, doesn't Linux have a fork that does copy on write? &amp;nbsp;I
&lt;br&gt;&amp;gt; mean, Sun OS got that around 1990 or so (at least that's when I was
&lt;br&gt;&amp;gt; told not to bother using vfork because fork now didn't use as much
&lt;br&gt;&amp;gt; memory), surely Linux has caught up to 15 years ago.
&lt;br&gt;&lt;br&gt;Normally it works fine, but it will fail if you don't have swap space &amp;nbsp;
&lt;br&gt;allocated because that's factored into the free space calc when the &amp;nbsp;
&lt;br&gt;fork happens.
&lt;br&gt;&lt;br&gt;What's the swap space setup for your VPS setup?
&lt;br&gt;&lt;br&gt;-- Ken
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Fri, Nov 6, 2009 at 8:29 PM, Ken Krugler &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26242131&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kkrugler_lists@...&lt;/a&gt; 
&lt;br&gt;&amp;gt; &amp;gt; wrote:
&lt;br&gt;&amp;gt;&amp;gt; Hi Paul,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; Hadoop uses the whoami command to find out what user it's running as.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; When running a command line tool from Java, the process running the &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; JVM gets
&lt;br&gt;&amp;gt;&amp;gt; forked.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; This in turn can trigger out of memory errors if you're running &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; without
&lt;br&gt;&amp;gt;&amp;gt; much/any swap or your OS doesn't support memory overcommit.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I ran into something similar when running some Java code in a VMWare
&lt;br&gt;&amp;gt;&amp;gt; environment where the instances hadn't been set up with any swap &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; space.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; See &lt;a href=&quot;http://issues.apache.org/jira/browse/HADOOP-5059&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://issues.apache.org/jira/browse/HADOOP-5059&lt;/a&gt;&amp;nbsp;for more &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; details.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; -- Ken
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; On Nov 6, 2009, at 5:21pm, Paul Tomblin wrote:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; I'm trying to move my crawler from a shared hosting environment &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; (where
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; it kept getting killed off for using too much memory) to a VPS. &amp;nbsp;But
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; on the new host, I'm getting the following exception:
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; [ WARN] 01:15:17 (FileSystem.java:&amp;lt;init&amp;gt;:1440)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; uri=file:///
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; javax.security.auth.login.LoginException: Login failed: Cannot run
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; program &amp;quot;whoami&amp;quot;: java.io.IOException: error=12, Cannot allocate
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; memory
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .security 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .security 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .security 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .security.UserGroupInformation.login(UserGroupInformation.java:67)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.fs.FileSystem$Cache$Key.&amp;lt;init&amp;gt;(FileSystem.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 1438)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 1376)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 319)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; org 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; .hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 313)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlIt(Crawler.java:407)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlSite(Crawler.java:381)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlCategory(Crawler.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 255)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawl(Crawler.java:166)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; at com.lucidityworks.nutch.crawler.Crawler.main(Crawler.java: 
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; 724)
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; What is going on here?
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;&amp;gt; Ken Krugler
&lt;br&gt;&amp;gt;&amp;gt; +1 530-210-6378
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt; e l a s t i c &amp;nbsp; w e b &amp;nbsp; m i n i n g
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -- 
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;/div&gt;&lt;br&gt;--------------------------------------------
&lt;br&gt;Ken Krugler
&lt;br&gt;+1 530-210-6378
&lt;br&gt;&lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;e l a s t i c &amp;nbsp; w e b &amp;nbsp; m i n i n g
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26242131.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26241631</id>
	<title>Distributed search, is there a better method?</title>
	<published>2009-11-06T18:37:11Z</published>
	<updated>2009-11-06T18:37:11Z</updated>
	<author>
		<name>Jesse Hires</name>
	</author>
	<content type="html">I was curious if anyone has a better method than what I am doing now for
&lt;br&gt;distributed search.
&lt;br&gt;Using one namenode and two datanodes, I am also using the namenode as the
&lt;br&gt;tomcat server, and using the datanodes as the distributed search nodes.
&lt;br&gt;&lt;br&gt;First I generate a segment (-topN 1000)
&lt;br&gt;Then &amp;nbsp;fetch
&lt;br&gt;Then &amp;nbsp;updatedb
&lt;br&gt;Then &amp;nbsp;merge segements into one.
&lt;br&gt;Then &amp;nbsp;invertlinks
&lt;br&gt;&lt;br&gt;Then I do the following:
&lt;br&gt;I get the stats of the crawldb, in order to get the number of URLs
&lt;br&gt;I run mergesegs -slice (1/2 number of urls)
&lt;br&gt;&lt;br&gt;index segment 1 and copytolocal the new index to datanode 1
&lt;br&gt;index segment 2 and copytolocal this new index to datanode 2
&lt;br&gt;&lt;br&gt;then restart nutch servers on the datanodes.
&lt;br&gt;&lt;br&gt;It seems to work fine, though I admit I've not gotten beyond about 30k urls
&lt;br&gt;fetched with about 100k urls still unfetched.
&lt;br&gt;&lt;br&gt;I tried using the -slice option on the initial merge, but I found on
&lt;br&gt;occasion there was no parse data in one segment, or I got an unexpected
&lt;br&gt;number of segments. I'm guessing this is because updatedb needs to be run
&lt;br&gt;before I can get an accurate number of URLs to do the math to get the same
&lt;br&gt;number of segments as search servers.
&lt;br&gt;&lt;br&gt;One problem I've run into so far is the amount of time the generate command
&lt;br&gt;increases with each iteration. The only item that really seems to grow out
&lt;br&gt;of control is the unfetched URLs, which is expected with such a small sample
&lt;br&gt;of web pages, but it doesn't make sense to me as to why it would take so
&lt;br&gt;long to generate a list of 1000 urls to fetch out of a list of 100k. Those
&lt;br&gt;are small numbers in terms of database and computing in general.
&lt;br&gt;&lt;br&gt;The next hangup I run into is the mergesegs and mergesegs -slice. Both of
&lt;br&gt;these steps increase in amount of time by an extreme amount once reaching
&lt;br&gt;about 100k URLs.
&lt;br&gt;&lt;br&gt;&lt;br&gt;Is this expected or common?
&lt;br&gt;Has anyone come up with a better way to go through the steps to get multiple
&lt;br&gt;unique indexes to reside on the individual search server nodes?
&lt;br&gt;&lt;br&gt;This is purely academic for me, so there really is no time lost on my part
&lt;br&gt;to change up my approach. I am also purposely using low power hardware.
&lt;br&gt;&lt;br&gt;Jesse
&lt;br&gt;&lt;br&gt;int GetRandomNumber()
&lt;br&gt;{
&lt;br&gt;&amp;nbsp; &amp;nbsp;return 4; // Chosen by fair roll of dice
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // Guaranteed to be random
&lt;br&gt;} // xkcd.com
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Distributed-search%2C-is-there-a-better-method--tp26241631p26241631.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26241314</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-06T17:39:07Z</published>
	<updated>2009-11-06T17:39:07Z</updated>
	<author>
		<name>Paul Tomblin</name>
	</author>
	<content type="html">Wait a second, doesn't Linux have a fork that does copy on write? &amp;nbsp;I
&lt;br&gt;mean, Sun OS got that around 1990 or so (at least that's when I was
&lt;br&gt;told not to bother using vfork because fork now didn't use as much
&lt;br&gt;memory), surely Linux has caught up to 15 years ago.
&lt;br&gt;&lt;br&gt;On Fri, Nov 6, 2009 at 8:29 PM, Ken Krugler &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26241314&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;kkrugler_lists@...&lt;/a&gt;&amp;gt; wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hi Paul,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Hadoop uses the whoami command to find out what user it's running as.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; When running a command line tool from Java, the process running the JVM gets
&lt;br&gt;&amp;gt; forked.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; This in turn can trigger out of memory errors if you're running without
&lt;br&gt;&amp;gt; much/any swap or your OS doesn't support memory overcommit.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I ran into something similar when running some Java code in a VMWare
&lt;br&gt;&amp;gt; environment where the instances hadn't been set up with any swap space.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; See &lt;a href=&quot;http://issues.apache.org/jira/browse/HADOOP-5059&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://issues.apache.org/jira/browse/HADOOP-5059&lt;/a&gt;&amp;nbsp;for more details.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -- Ken
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; On Nov 6, 2009, at 5:21pm, Paul Tomblin wrote:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; I'm trying to move my crawler from a shared hosting environment (where
&lt;br&gt;&amp;gt;&amp;gt; it kept getting killed off for using too much memory) to a VPS.  But
&lt;br&gt;&amp;gt;&amp;gt; on the new host, I'm getting the following exception:
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; [ WARN] 01:15:17 (FileSystem.java:&amp;lt;init&amp;gt;:1440)
&lt;br&gt;&amp;gt;&amp;gt; uri=file:///
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; javax.security.auth.login.LoginException: Login failed: Cannot run
&lt;br&gt;&amp;gt;&amp;gt; program &amp;quot;whoami&amp;quot;: java.io.IOException: error=12, Cannot allocate
&lt;br&gt;&amp;gt;&amp;gt; memory
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.fs.FileSystem$Cache$Key.&amp;lt;init&amp;gt;(FileSystem.java:1438)
&lt;br&gt;&amp;gt;&amp;gt;       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
&lt;br&gt;&amp;gt;&amp;gt;       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
&lt;br&gt;&amp;gt;&amp;gt;       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:319)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:313)
&lt;br&gt;&amp;gt;&amp;gt;       at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
&lt;br&gt;&amp;gt;&amp;gt;       at com.lucidityworks.nutch.crawler.Crawler.crawlIt(Crawler.java:407)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlSite(Crawler.java:381)
&lt;br&gt;&amp;gt;&amp;gt;       at
&lt;br&gt;&amp;gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlCategory(Crawler.java:255)
&lt;br&gt;&amp;gt;&amp;gt;       at com.lucidityworks.nutch.crawler.Crawler.crawl(Crawler.java:166)
&lt;br&gt;&amp;gt;&amp;gt;       at com.lucidityworks.nutch.crawler.Crawler.main(Crawler.java:724)
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; What is going on here?
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; --
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt; Ken Krugler
&lt;br&gt;&amp;gt; +1 530-210-6378
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;&amp;gt; e l a s t i c   w e b   m i n i n g
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;&lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26241314.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26241289</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-06T17:31:56Z</published>
	<updated>2009-11-06T17:31:56Z</updated>
	<author>
		<name>Neera</name>
	</author>
	<content type="html">Most likely your allocated memory settings for java process is high.
&lt;br&gt;&lt;br&gt;Usually you get this error if
&lt;br&gt;2 * Memory configured for the process &amp;gt; (total physical memory
&lt;br&gt;available + swap space)
&lt;br&gt;&lt;br&gt;&lt;br&gt;You can find more explanation about this error at
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://issues.apache.org/jira/browse/HADOOP-5059&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://issues.apache.org/jira/browse/HADOOP-5059&lt;/a&gt;&lt;br&gt;&lt;br&gt;HTH.
&lt;br&gt;&lt;br&gt;Neera
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;On Fri, Nov 6, 2009 at 5:21 PM, Paul Tomblin &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26241289&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ptomblin@...&lt;/a&gt;&amp;gt; wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I'm trying to move my crawler from a shared hosting environment (where
&lt;br&gt;&amp;gt; it kept getting killed off for using too much memory) to a VPS.  But
&lt;br&gt;&amp;gt; on the new host, I'm getting the following exception:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; [ WARN] 01:15:17 (FileSystem.java:&amp;lt;init&amp;gt;:1440)
&lt;br&gt;&amp;gt; uri=file:///
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; javax.security.auth.login.LoginException: Login failed: Cannot run
&lt;br&gt;&amp;gt; program &amp;quot;whoami&amp;quot;: java.io.IOException: error=12, Cannot allocate
&lt;br&gt;&amp;gt; memory
&lt;br&gt;&amp;gt;        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.fs.FileSystem$Cache$Key.&amp;lt;init&amp;gt;(FileSystem.java:1438)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:319)
&lt;br&gt;&amp;gt;        at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:313)
&lt;br&gt;&amp;gt;        at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
&lt;br&gt;&amp;gt;        at com.lucidityworks.nutch.crawler.Crawler.crawlIt(Crawler.java:407)
&lt;br&gt;&amp;gt;        at com.lucidityworks.nutch.crawler.Crawler.crawlSite(Crawler.java:381)
&lt;br&gt;&amp;gt;        at com.lucidityworks.nutch.crawler.Crawler.crawlCategory(Crawler.java:255)
&lt;br&gt;&amp;gt;        at com.lucidityworks.nutch.crawler.Crawler.crawl(Crawler.java:166)
&lt;br&gt;&amp;gt;        at com.lucidityworks.nutch.crawler.Crawler.main(Crawler.java:724)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; What is going on here?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&lt;/div&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26241289.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26241282</id>
	<title>Re: Hadoop wants to do whoami?</title>
	<published>2009-11-06T17:29:59Z</published>
	<updated>2009-11-06T17:29:59Z</updated>
	<author>
		<name>Ken Krugler</name>
	</author>
	<content type="html">Hi Paul,
&lt;br&gt;&lt;br&gt;Hadoop uses the whoami command to find out what user it's running as.
&lt;br&gt;&lt;br&gt;When running a command line tool from Java, the process running the &amp;nbsp;
&lt;br&gt;JVM gets forked.
&lt;br&gt;&lt;br&gt;This in turn can trigger out of memory errors if you're running &amp;nbsp;
&lt;br&gt;without much/any swap or your OS doesn't support memory overcommit.
&lt;br&gt;&lt;br&gt;I ran into something similar when running some Java code in a VMWare &amp;nbsp;
&lt;br&gt;environment where the instances hadn't been set up with any swap space.
&lt;br&gt;&lt;br&gt;See &lt;a href=&quot;http://issues.apache.org/jira/browse/HADOOP-5059&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://issues.apache.org/jira/browse/HADOOP-5059&lt;/a&gt;&amp;nbsp;for more details.
&lt;br&gt;&lt;br&gt;-- Ken
&lt;br&gt;&lt;br&gt;&lt;br&gt;On Nov 6, 2009, at 5:21pm, Paul Tomblin wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I'm trying to move my crawler from a shared hosting environment (where
&lt;br&gt;&amp;gt; it kept getting killed off for using too much memory) to a VPS. &amp;nbsp;But
&lt;br&gt;&amp;gt; on the new host, I'm getting the following exception:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; [ WARN] 01:15:17 (FileSystem.java:&amp;lt;init&amp;gt;:1440)
&lt;br&gt;&amp;gt; uri=file:///
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; javax.security.auth.login.LoginException: Login failed: Cannot run
&lt;br&gt;&amp;gt; program &amp;quot;whoami&amp;quot;: java.io.IOException: error=12, Cannot allocate
&lt;br&gt;&amp;gt; memory
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org 
&lt;br&gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt; .security 
&lt;br&gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org 
&lt;br&gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt; .security 
&lt;br&gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org 
&lt;br&gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt; .security 
&lt;br&gt;&amp;gt; .UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org 
&lt;br&gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt; .hadoop 
&lt;br&gt;&amp;gt; .security.UserGroupInformation.login(UserGroupInformation.java:67)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.fs.FileSystem$Cache 
&lt;br&gt;&amp;gt; $Key.&amp;lt;init&amp;gt;(FileSystem.java:1438)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java: 
&lt;br&gt;&amp;gt; 1376)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:319)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; org 
&lt;br&gt;&amp;gt; .apache 
&lt;br&gt;&amp;gt; .hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:313)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.nutch.crawl.Injector.inject(Injector.java:152)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlIt(Crawler.java:407)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlSite(Crawler.java:381)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at &amp;nbsp;
&lt;br&gt;&amp;gt; com.lucidityworks.nutch.crawler.Crawler.crawlCategory(Crawler.java: 
&lt;br&gt;&amp;gt; 255)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at com.lucidityworks.nutch.crawler.Crawler.crawl(Crawler.java: 
&lt;br&gt;&amp;gt; 166)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at com.lucidityworks.nutch.crawler.Crawler.main(Crawler.java: 
&lt;br&gt;&amp;gt; 724)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; What is going on here?
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -- 
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.linkedin.com/in/paultomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.linkedin.com/in/paultomblin&lt;/a&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://careers.stackoverflow.com/ptomblin&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://careers.stackoverflow.com/ptomblin&lt;/a&gt;&lt;/div&gt;&lt;br&gt;--------------------------------------------
&lt;br&gt;Ken Krugler
&lt;br&gt;+1 530-210-6378
&lt;br&gt;&lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;e l a s t i c &amp;nbsp; w e b &amp;nbsp; m i n i n g
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Hadoop-wants-to-do-whoami--tp26241246p26241282.html" />
</entry>

</feed>
