<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-373</id>
	<title>Nabble - Nutch - Dev</title>
	<updated>2009-11-07T21:41:30Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Nutch---Dev-f373.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Nutch---Dev-f373.html" />
	<subtitle type="html"></subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26251198</id>
	<title>Hudson build is back to normal: Nutch-trunk #986</title>
	<published>2009-11-07T21:41:30Z</published>
	<updated>2009-11-07T21:41:30Z</updated>
	<author>
		<name>Apache Hudson Server</name>
	</author>
	<content type="html">See &amp;lt;&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/986/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/986/&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Build-failed-in-Hudson%3A-Nutch-trunk--985-tp26241956p26251198.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26241956</id>
	<title>Build failed in Hudson: Nutch-trunk #985</title>
	<published>2009-11-06T20:03:16Z</published>
	<updated>2009-11-06T20:03:16Z</updated>
	<author>
		<name>Apache Hudson Server</name>
	</author>
	<content type="html">See &amp;lt;&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/985/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Nutch-trunk/985/&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;------------------------------------------
&lt;br&gt;A timer trigger started this job
&lt;br&gt;Building remotely on lucene.zones.apache.org (Solaris 10)
&lt;br&gt;Checking out &lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/nutch/trunk&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/lucene/nutch/trunk&lt;/a&gt;&lt;br&gt;ERROR: Failed to check out &lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/nutch/trunk&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/lucene/nutch/trunk&lt;/a&gt;&lt;br&gt;org.tmatesoft.svn.core.SVNException: svn: timed out waiting for server
&lt;br&gt;svn: OPTIONS request failed on '/repos/asf/lucene/nutch/trunk'
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:103)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:87)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:616)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:273)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:261)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.exchangeCapabilities(DAVConnection.java:516)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.open(DAVConnection.java:98)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.openConnection(DAVRepository.java:1001)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.getLatestRevision(DAVRepository.java:178)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.wc.SVNBasicClient.getRevisionNumber(SVNBasicClient.java:482)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.wc.SVNBasicClient.getLocations(SVNBasicClient.java:851)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.wc.SVNBasicClient.createRepository(SVNBasicClient.java:534)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.wc.SVNUpdateClient.doCheckout(SVNUpdateClient.java:893)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.wc.SVNUpdateClient.doCheckout(SVNUpdateClient.java:791)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:617)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:543)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2052)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.remoting.UserRequest.perform(UserRequest.java:69)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.remoting.UserRequest.perform(UserRequest.java:23)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at hudson.remoting.Request$2.run(Request.java:200)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.concurrent.FutureTask.run(FutureTask.java:138)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.Thread.run(Thread.java:619)
&lt;br&gt;Caused by: java.net.SocketTimeoutException: connect timed out
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.PlainSocketImpl.socketConnect(Native Method)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.net.Socket.connect(Socket.java:519)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.tmatesoft.svn.core.internal.util.SVNSocketConnection.run(SVNSocketConnection.java:57)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ... 1 more
&lt;br&gt;Archiving artifacts
&lt;br&gt;Publishing Javadoc
&lt;br&gt;Recording test results
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Build-failed-in-Hudson%3A-Nutch-trunk--985-tp26241956p26241956.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26236750</id>
	<title>New attachment added to page Presentations on Nutch Wiki</title>
	<published>2009-11-06T10:42:08Z</published>
	<updated>2009-11-06T10:42:08Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page &amp;quot;Presentations&amp;quot; for change notification. An attachment has been added to that page by AndrzejBialecki. Following detailed information is available:
&lt;br&gt;&lt;br&gt;Attachment name: apachecon09.pdf
&lt;br&gt;Attachment size: 990738
&lt;br&gt;Attachment link: &lt;a href=&quot;http://wiki.apache.org/nutch/Presentations?action=AttachFile&amp;do=get&amp;target=apachecon09.pdf&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/Presentations?action=AttachFile&amp;do=get&amp;target=apachecon09.pdf&lt;/a&gt;&lt;br&gt;Page link: &lt;a href=&quot;http://wiki.apache.org/nutch/Presentations&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/Presentations&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/New-attachment-added-to-page-Presentations-on-Nutch-Wiki-tp26236750p26236750.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26235714</id>
	<title>[Nutch Wiki] Update of &quot;Presentations&quot; by AndrzejBialecki</title>
	<published>2009-11-06T09:33:56Z</published>
	<updated>2009-11-06T09:33:56Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;Presentations&amp;quot; page has been changed by AndrzejBialecki.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/Presentations?action=diff&amp;rev1=10&amp;rev2=11&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/Presentations?action=diff&amp;rev1=10&amp;rev2=11&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;+ Recent presentations:
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* [[attachment:apachecon09.pdf|apachecon09.pdf]]: &amp;quot;Nutch, web-scale search engine toolkit&amp;quot;, Andrzej Bialecki, 5 Nov 2009, Apache``Con`` 2009, Oakland.
&lt;br&gt;+ 
&lt;br&gt;+ 
&lt;br&gt;&amp;nbsp; Here are the slides from presentations about Nutch given by Doug Cutting:
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; &amp;nbsp; * [[attachment:serverside1.pdf|serverside1.pdf]]: &amp;quot;Intranet Search with Nutch&amp;quot;, 6 May 2004, The``Server``Side Java Symposium, Las Vegas.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22Presentations%22-by-AndrzejBialecki-tp26235714p26235714.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26235654</id>
	<title>New attachment added to page Presentations on Nutch Wiki</title>
	<published>2009-11-06T09:29:18Z</published>
	<updated>2009-11-06T09:29:18Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page &amp;quot;Presentations&amp;quot; for change notification. An attachment has been added to that page by AndrzejBialecki. Following detailed information is available:
&lt;br&gt;&lt;br&gt;Attachment name: apachecon09.pdf
&lt;br&gt;Attachment size: 1313737
&lt;br&gt;Attachment link: &lt;a href=&quot;http://wiki.apache.org/nutch/Presentations?action=AttachFile&amp;do=get&amp;target=apachecon09.pdf&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/Presentations?action=AttachFile&amp;do=get&amp;target=apachecon09.pdf&lt;/a&gt;&lt;br&gt;Page link: &lt;a href=&quot;http://wiki.apache.org/nutch/Presentations&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/Presentations&lt;/a&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/New-attachment-added-to-page-Presentations-on-Nutch-Wiki-tp26235654p26235654.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26219999</id>
	<title>[jira] Created: (NUTCH-763) Separate configuration files from resources to be included in the job file</title>
	<published>2009-11-05T10:34:32Z</published>
	<updated>2009-11-05T10:34:32Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Separate configuration files from resources to be included in the job file
&lt;br&gt;--------------------------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-763
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-763&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-763&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Wish
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Julien Nioche
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Minor
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Fix For: 1.1
&lt;br&gt;&lt;br&gt;&lt;br&gt;One of the things I found confusing when I was learning Nutch was the fact that the conf/ directory contains at the same time : 
&lt;br&gt;- configuration files for Hadoop / Nutch which are put in the jar files but not used there
&lt;br&gt;- resource files (e.g. filtering rules) which MUST be up to date in the job file
&lt;br&gt;&lt;br&gt;I would separate the conf/ directory from say a resources/ directory which would contain the rule files and other things to put in the job file. Unless I am mistaken none of the configuration files need to be in the job file. I know it is a very minor point, but that would probably simplify things and make it easier for beginners to understand what has to be modified where. 
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-763%29-Separate-configuration-files-from-resources-to-be-included-in-the-job-file-tp26219999p26219999.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26213390</id>
	<title>Re: [Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-05T04:08:40Z</published>
	<updated>2009-11-05T04:08:40Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Bartosz Gadzimski wrote:
&lt;br&gt;&amp;gt; Hello,
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Are there will be any materials after the meeting? Wiki pages, slides, 
&lt;br&gt;&amp;gt; video, podcasts? Would be grate!
&lt;br&gt;&lt;br&gt;I will be adding some stuff later - no slides, as Ken said it was very 
&lt;br&gt;informal. Though the slides that I present during my Nutch talk have a 
&lt;br&gt;section about the roadmap, and I will be adding these to the wiki, too.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205427p26213390.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26212369</id>
	<title>Re: [Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-05T02:41:06Z</published>
	<updated>2009-11-05T02:41:06Z</updated>
	<author>
		<name>Ken Krugler</name>
	</author>
	<content type="html">Hi Bartosz,
&lt;br&gt;&lt;br&gt;I've updated the wiki, and others who attended might add/edit as &amp;nbsp;
&lt;br&gt;necessary.
&lt;br&gt;&lt;br&gt;No video/podcast - it wasn't so high tech as that, just three of us in &amp;nbsp;
&lt;br&gt;a spare room with Thorsten on Skype.
&lt;br&gt;&lt;br&gt;We're still waiting for some input from the Heritrix team, I think, &amp;nbsp;
&lt;br&gt;before moving forward.
&lt;br&gt;&lt;br&gt;-- Ken
&lt;br&gt;&lt;br&gt;&lt;br&gt;On Nov 5, 2009, at 12:01am, Bartosz Gadzimski wrote:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Hello,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Are there will be any materials after the meeting? Wiki pages, &amp;nbsp;
&lt;br&gt;&amp;gt; slides, video, podcasts? Would be grate!
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Thanks,
&lt;br&gt;&amp;gt; Bartosz
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Apache Wiki pisze:
&lt;br&gt;&amp;gt;&amp;gt; Dear Wiki user,
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; for change notification.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&amp;gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&lt;/a&gt;&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; --------------------------------------------------
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt; - We were planning to have a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; US]] in Oakland.
&lt;br&gt;&amp;gt;&amp;gt; + We had a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;&amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; |ApacheCon US]] in Oakland.
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp;- Unfortunately the only time slot where people would be around &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; was Thursday night, which wound up conflicting with the Hadoop ! 
&lt;br&gt;&amp;gt;&amp;gt; MeetUp.
&lt;br&gt;&amp;gt;&amp;gt; + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; November 4th from 11am - 1pm. &amp;nbsp; - So we're going to have an ! 
&lt;br&gt;&amp;gt;&amp;gt; UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; Location is TBD, hopefully we can get some space at the event but &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; might be a lunch meeting :)
&lt;br&gt;&amp;gt;&amp;gt; + == Attendees ==
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Andrzej Bialeki - Apache Nutch
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Thorsten xxx - Apache Droids
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Michael Stack - Formerly with Heritrix, now HBase
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Ken Krugler - Bixo
&lt;br&gt;&amp;gt;&amp;gt; + + == Topics ==
&lt;br&gt;&amp;gt;&amp;gt; + + === Roadmaps ===
&lt;br&gt;&amp;gt;&amp;gt; + + Nutch - become more component based.
&lt;br&gt;&amp;gt;&amp;gt; + Droids - get more people involved.
&lt;br&gt;&amp;gt;&amp;gt; + + === Sharable Components ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* robots.txt parsing
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* URL normalization
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* URL filtering
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Page cleansing
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * General purpose
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * Specialized
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Sub-page parsing (portlets)
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* AJAX-ish page interactions
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Document parsing (via Tika)
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* HttpClient (configuration)
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Text similarity
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Mime/charset/language detection
&lt;br&gt;&amp;gt;&amp;gt; + + === Tika ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Needs help to become really usable
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Would benefit from large test corpus
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Could do comparison with Nutch parser
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Needs option for direct DOM querying (screen scraping tasks)
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Handles mime &amp; charset detection now (some issues)
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Could be extended to include language detection (wrap other &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; impl)
&lt;br&gt;&amp;gt;&amp;gt; + + === URL Normalization ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Includes both domain (www.x.com == x.com), path, and query &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; portions of URL
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Often site-specific rules
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * Option to derive rules using URLs to similar documents.
&lt;br&gt;&amp;gt;&amp;gt; + + === AJAX-ish Page Interaction ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Not applicable for broad/general crawling
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Can be very important for specific web sites
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Use Selenium or headless Mozilla
&lt;br&gt;&amp;gt;&amp;gt; + + === Component API Issues ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Want to avoid using an API that's tied too closely to any &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; implementation.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* One option is to have simple (e.g. URL param) API that takes &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; meta-data.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * Similar to Tika passing in of meta-data.
&lt;br&gt;&amp;gt;&amp;gt; + + === Hosting Options ===
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* As part of Nutch - but easy to get lost in Nutch codebase, &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; and can be associated too closely with Nutch.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* As part of Droids - but Droids is both a framework (queue- 
&lt;br&gt;&amp;gt;&amp;gt; based) and set of components.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* New sub-project under Lucene TLP - but overhead to set up/ 
&lt;br&gt;&amp;gt;&amp;gt; maintain, and then confusion between it and Droids.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Google code - seems like a good short-term solution, to judge &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; level of interest and help shake out issues.
&lt;br&gt;&amp;gt;&amp;gt; + + == Next Steps ==
&lt;br&gt;&amp;gt;&amp;gt; + + &amp;nbsp;* Get input from Gordon re Heritrix. Stack to follow up with &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; him. Ideally he'd add his comments to this page.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Get input from Thorsten on Google code option. If OK as &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; starting point, then Andrzej to set up.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Make decision about build system (and then move on to code &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; formatting debate :))
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * I'm going to propose ant + maven ant tasks for dependency &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; management. I'm using this with Bixo, and so far it's been pretty &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; good.
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp;* Start contributing code
&lt;br&gt;&amp;gt;&amp;gt; + &amp;nbsp; * Ken will put in robots.txt parser.
&lt;br&gt;&amp;gt;&amp;gt; + + == Original Discussion Topic List ==
&lt;br&gt;&amp;gt;&amp;gt; &amp;nbsp; &amp;nbsp;Below are some potential topics for discussion - feel free to &amp;nbsp;
&lt;br&gt;&amp;gt;&amp;gt; add/comment.
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;--------------------------------------------
&lt;br&gt;Ken Krugler
&lt;br&gt;+1 530-210-6378
&lt;br&gt;&lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;e l a s t i c &amp;nbsp; w e b &amp;nbsp; m i n i n g
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205427p26212369.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26210589</id>
	<title>Re: [Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-05T00:01:40Z</published>
	<updated>2009-11-05T00:01:40Z</updated>
	<author>
		<name>Bartosz Gadzimski</name>
	</author>
	<content type="html">Hello,
&lt;br&gt;&lt;br&gt;Are there will be any materials after the meeting? Wiki pages, slides, 
&lt;br&gt;video, podcasts? Would be grate!
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;Bartosz
&lt;br&gt;&lt;br&gt;Apache Wiki pisze:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Dear Wiki user,
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; --------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; - We were planning to have a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;&amp;gt; + We had a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&amp;gt; - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp.
&lt;br&gt;&amp;gt; + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. 
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&amp;gt; - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :)
&lt;br&gt;&amp;gt; + == Attendees ==
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Andrzej Bialeki - Apache Nutch
&lt;br&gt;&amp;gt; + &amp;nbsp;* Thorsten xxx - Apache Droids
&lt;br&gt;&amp;gt; + &amp;nbsp;* Michael Stack - Formerly with Heritrix, now HBase
&lt;br&gt;&amp;gt; + &amp;nbsp;* Ken Krugler - Bixo
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + == Topics ==
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === Roadmaps ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + Nutch - become more component based.
&lt;br&gt;&amp;gt; + Droids - get more people involved.
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === Sharable Components ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* robots.txt parsing
&lt;br&gt;&amp;gt; + &amp;nbsp;* URL normalization
&lt;br&gt;&amp;gt; + &amp;nbsp;* URL filtering
&lt;br&gt;&amp;gt; + &amp;nbsp;* Page cleansing
&lt;br&gt;&amp;gt; + &amp;nbsp; * General purpose
&lt;br&gt;&amp;gt; + &amp;nbsp; * Specialized
&lt;br&gt;&amp;gt; + &amp;nbsp;* Sub-page parsing (portlets)
&lt;br&gt;&amp;gt; + &amp;nbsp;* AJAX-ish page interactions
&lt;br&gt;&amp;gt; + &amp;nbsp;* Document parsing (via Tika)
&lt;br&gt;&amp;gt; + &amp;nbsp;* HttpClient (configuration)
&lt;br&gt;&amp;gt; + &amp;nbsp;* Text similarity
&lt;br&gt;&amp;gt; + &amp;nbsp;* Mime/charset/language detection
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === Tika ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Needs help to become really usable
&lt;br&gt;&amp;gt; + &amp;nbsp;* Would benefit from large test corpus
&lt;br&gt;&amp;gt; + &amp;nbsp;* Could do comparison with Nutch parser
&lt;br&gt;&amp;gt; + &amp;nbsp;* Needs option for direct DOM querying (screen scraping tasks)
&lt;br&gt;&amp;gt; + &amp;nbsp;* Handles mime &amp; charset detection now (some issues)
&lt;br&gt;&amp;gt; + &amp;nbsp;* Could be extended to include language detection (wrap other impl)
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === URL Normalization ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Includes both domain (www.x.com == x.com), path, and query portions of URL
&lt;br&gt;&amp;gt; + &amp;nbsp;* Often site-specific rules
&lt;br&gt;&amp;gt; + &amp;nbsp; * Option to derive rules using URLs to similar documents.
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === AJAX-ish Page Interaction ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Not applicable for broad/general crawling
&lt;br&gt;&amp;gt; + &amp;nbsp;* Can be very important for specific web sites
&lt;br&gt;&amp;gt; + &amp;nbsp;* Use Selenium or headless Mozilla
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === Component API Issues ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Want to avoid using an API that's tied too closely to any implementation.
&lt;br&gt;&amp;gt; + &amp;nbsp;* One option is to have simple (e.g. URL param) API that takes meta-data.
&lt;br&gt;&amp;gt; + &amp;nbsp; * Similar to Tika passing in of meta-data.
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + === Hosting Options ===
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
&lt;br&gt;&amp;gt; + &amp;nbsp;* As part of Droids - but Droids is both a framework (queue-based) and set of components.
&lt;br&gt;&amp;gt; + &amp;nbsp;* New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
&lt;br&gt;&amp;gt; + &amp;nbsp;* Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + == Next Steps ==
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + &amp;nbsp;* Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
&lt;br&gt;&amp;gt; + &amp;nbsp;* Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
&lt;br&gt;&amp;gt; + &amp;nbsp;* Make decision about build system (and then move on to code formatting debate :))
&lt;br&gt;&amp;gt; + &amp;nbsp; * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
&lt;br&gt;&amp;gt; + &amp;nbsp;* Start contributing code
&lt;br&gt;&amp;gt; + &amp;nbsp; * Ken will put in robots.txt parser.
&lt;br&gt;&amp;gt; + 
&lt;br&gt;&amp;gt; + == Original Discussion Topic List ==
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&amp;gt; &amp;nbsp; Below are some potential topics for discussion - feel free to add/comment.
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205427p26210589.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26214590</id>
	<title>MergeSegments  - map reduce thread death</title>
	<published>2009-11-04T17:29:14Z</published>
	<updated>2009-11-04T17:29:14Z</updated>
	<author>
		<name>Fadzi Ushewokunze-2</name>
	</author>
	<content type="html">Hi there,
&lt;br&gt;&lt;br&gt;seems i have some serious problems with hadoop during map-reduce for
&lt;br&gt;MergeSegments.
&lt;br&gt;&lt;br&gt;i am out of ideas on this. Any suggestions will be quite welcome.
&lt;br&gt;&lt;br&gt;Here is my set up:
&lt;br&gt;&lt;br&gt;RAM: 4G
&lt;br&gt;JVM HEAP: 2G
&lt;br&gt;mapred.child.java.opts = 1024M
&lt;br&gt;hadoop-0.19.1-core.jar
&lt;br&gt;nutch-1.0
&lt;br&gt;Xen VPS.
&lt;br&gt;&lt;br&gt;After running a recrawl a few times; i end up with one segment that is
&lt;br&gt;relatively larger compared to the new ones last generated. here is my
&lt;br&gt;segments structure when things blow up after a (5th) recrawl;
&lt;br&gt;&lt;br&gt;segment1 = 674Megs (after several recrawls)
&lt;br&gt;segment2 = 580k (last recrawl)
&lt;br&gt;segment3 = 568k (last recrawl)
&lt;br&gt;segment4 = 584k (last recrawl)
&lt;br&gt;..
&lt;br&gt;segment8 = 560k (last recrawl)
&lt;br&gt;&lt;br&gt;when i run mergeSegments everything goes well until we get up to 90% of
&lt;br&gt;the map-reduce and we get a thread death; here is a stack trace
&lt;br&gt;&lt;br&gt;2009-11-05 10:54:16,874 INFO &amp;nbsp;[org.apache.hadoop.mapred.LocalJobRunner]
&lt;br&gt;reduce &amp;gt; reduce
&lt;br&gt;2009-11-05 10:54:29,794 INFO &amp;nbsp;[org.apache.hadoop.mapred.LocalJobRunner]
&lt;br&gt;reduce &amp;gt; reduce
&lt;br&gt;2009-11-05 10:54:55,194 INFO &amp;nbsp;[org.apache.hadoop.mapred.LocalJobRunner]
&lt;br&gt;reduce &amp;gt; reduce
&lt;br&gt;2009-11-05 10:57:25,844 WARN &amp;nbsp;[org.apache.hadoop.mapred.LocalJobRunner]
&lt;br&gt;job_local_0001
&lt;br&gt;java.lang.ThreadDeath
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at java.lang.Thread.stop(Thread.java:715)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at
&lt;br&gt;org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)
&lt;br&gt;&lt;br&gt;any suggestions please!!!!
&lt;br&gt;&lt;br&gt;thanks.
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/MergeSegments----map-reduce-thread-death-tp26214590p26214590.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26205563</id>
	<title>[Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by AndrzejBialecki</title>
	<published>2009-11-04T14:12:11Z</published>
	<updated>2009-11-04T14:12:11Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by AndrzejBialecki.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=8&amp;rev2=9&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=8&amp;rev2=9&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; == Attendees ==
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- &amp;nbsp;* Andrzej Bialeki - Apache Nutch
&lt;br&gt;+ &amp;nbsp;* Andrzej Bialecki - Apache Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Thorsten Sherler - Apache Droids
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Michael Stack - Formerly with Heritrix, now HBase
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Ken Krugler - Bixo
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-AndrzejBialecki-tp26205563p26205563.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26205467</id>
	<title>[Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-04T14:04:41Z</published>
	<updated>2009-11-04T14:04:41Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=7&amp;rev2=8&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=7&amp;rev2=8&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; We had a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. 
&lt;br&gt;+ 
&lt;br&gt;+ -----
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; == Attendees ==
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;@@ -15, +17 @@
&lt;br&gt;&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; === Roadmaps ===
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- Nutch - become more component based.
&lt;br&gt;+ &amp;nbsp;* Nutch - become more component based.
&lt;br&gt;- Droids - get more people involved.
&lt;br&gt;+ &amp;nbsp;* Droids - get more people involved.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; === Sharable Components ===
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;@@ -76, +78 @@
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp;* Start contributing code
&lt;br&gt;&amp;nbsp; &amp;nbsp; * Ken will put in robots.txt parser.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;+ -----
&lt;br&gt;+ 
&lt;br&gt;&amp;nbsp; == Original Discussion Topic List ==
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; Below are some potential topics for discussion - feel free to add/comment.
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205467p26205467.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26205440</id>
	<title>[Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-04T14:03:23Z</published>
	<updated>2009-11-04T14:03:23Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=6&amp;rev2=7&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=6&amp;rev2=7&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; == Attendees ==
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Andrzej Bialeki - Apache Nutch
&lt;br&gt;- &amp;nbsp;* Thorsten xxx - Apache Droids
&lt;br&gt;+ &amp;nbsp;* Thorsten Sherler - Apache Droids
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Michael Stack - Formerly with Heritrix, now HBase
&lt;br&gt;&amp;nbsp; &amp;nbsp;* Ken Krugler - Bixo
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205440p26205440.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26205427</id>
	<title>[Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-11-04T14:02:35Z</published>
	<updated>2009-11-04T14:02:35Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=5&amp;rev2=6&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;- We were planning to have a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;+ We had a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp.
&lt;br&gt;+ It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. 
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :)
&lt;br&gt;+ == Attendees ==
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Andrzej Bialeki - Apache Nutch
&lt;br&gt;+ &amp;nbsp;* Thorsten xxx - Apache Droids
&lt;br&gt;+ &amp;nbsp;* Michael Stack - Formerly with Heritrix, now HBase
&lt;br&gt;+ &amp;nbsp;* Ken Krugler - Bixo
&lt;br&gt;+ 
&lt;br&gt;+ == Topics ==
&lt;br&gt;+ 
&lt;br&gt;+ === Roadmaps ===
&lt;br&gt;+ 
&lt;br&gt;+ Nutch - become more component based.
&lt;br&gt;+ Droids - get more people involved.
&lt;br&gt;+ 
&lt;br&gt;+ === Sharable Components ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* robots.txt parsing
&lt;br&gt;+ &amp;nbsp;* URL normalization
&lt;br&gt;+ &amp;nbsp;* URL filtering
&lt;br&gt;+ &amp;nbsp;* Page cleansing
&lt;br&gt;+ &amp;nbsp; * General purpose
&lt;br&gt;+ &amp;nbsp; * Specialized
&lt;br&gt;+ &amp;nbsp;* Sub-page parsing (portlets)
&lt;br&gt;+ &amp;nbsp;* AJAX-ish page interactions
&lt;br&gt;+ &amp;nbsp;* Document parsing (via Tika)
&lt;br&gt;+ &amp;nbsp;* HttpClient (configuration)
&lt;br&gt;+ &amp;nbsp;* Text similarity
&lt;br&gt;+ &amp;nbsp;* Mime/charset/language detection
&lt;br&gt;+ 
&lt;br&gt;+ === Tika ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Needs help to become really usable
&lt;br&gt;+ &amp;nbsp;* Would benefit from large test corpus
&lt;br&gt;+ &amp;nbsp;* Could do comparison with Nutch parser
&lt;br&gt;+ &amp;nbsp;* Needs option for direct DOM querying (screen scraping tasks)
&lt;br&gt;+ &amp;nbsp;* Handles mime &amp; charset detection now (some issues)
&lt;br&gt;+ &amp;nbsp;* Could be extended to include language detection (wrap other impl)
&lt;br&gt;+ 
&lt;br&gt;+ === URL Normalization ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Includes both domain (www.x.com == x.com), path, and query portions of URL
&lt;br&gt;+ &amp;nbsp;* Often site-specific rules
&lt;br&gt;+ &amp;nbsp; * Option to derive rules using URLs to similar documents.
&lt;br&gt;+ 
&lt;br&gt;+ === AJAX-ish Page Interaction ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Not applicable for broad/general crawling
&lt;br&gt;+ &amp;nbsp;* Can be very important for specific web sites
&lt;br&gt;+ &amp;nbsp;* Use Selenium or headless Mozilla
&lt;br&gt;+ 
&lt;br&gt;+ === Component API Issues ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Want to avoid using an API that's tied too closely to any implementation.
&lt;br&gt;+ &amp;nbsp;* One option is to have simple (e.g. URL param) API that takes meta-data.
&lt;br&gt;+ &amp;nbsp; * Similar to Tika passing in of meta-data.
&lt;br&gt;+ 
&lt;br&gt;+ === Hosting Options ===
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
&lt;br&gt;+ &amp;nbsp;* As part of Droids - but Droids is both a framework (queue-based) and set of components.
&lt;br&gt;+ &amp;nbsp;* New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
&lt;br&gt;+ &amp;nbsp;* Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
&lt;br&gt;+ 
&lt;br&gt;+ == Next Steps ==
&lt;br&gt;+ 
&lt;br&gt;+ &amp;nbsp;* Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
&lt;br&gt;+ &amp;nbsp;* Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
&lt;br&gt;+ &amp;nbsp;* Make decision about build system (and then move on to code formatting debate :))
&lt;br&gt;+ &amp;nbsp; * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
&lt;br&gt;+ &amp;nbsp;* Start contributing code
&lt;br&gt;+ &amp;nbsp; * Ken will put in robots.txt parser.
&lt;br&gt;+ 
&lt;br&gt;+ == Original Discussion Topic List ==
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; Below are some potential topics for discussion - feel free to add/comment.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26205427p26205427.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26206065</id>
	<title>Re: Free live video streaming of ApacheCon US 2009</title>
	<published>2009-11-04T06:55:16Z</published>
	<updated>2009-11-04T06:55:16Z</updated>
	<author>
		<name>Israel Ekpo</name>
	</author>
	<content type="html">Thanks a lot.&lt;br&gt;&lt;br&gt;This will be very helpful to me.&lt;br&gt;&lt;br&gt;As I am not able to attend.&lt;br&gt;&lt;br&gt;&lt;div class=&quot;gmail_quote&quot;&gt;On Wed, Nov 4, 2009 at 8:25 AM, Michael McCandless &lt;span dir=&quot;ltr&quot;&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26206065&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;lucene@...&lt;/a&gt;&amp;gt;&lt;/span&gt; wrote:&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;Team,&lt;br&gt;
&lt;br&gt;
For those Lucene fanatics not in Oakland this week for ApacheCon US,&lt;br&gt;
don&amp;#39;t miss the FREE live video streaming, starting today:&lt;br&gt;
&lt;br&gt;
  &lt;a href=&quot;http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm&lt;/a&gt;&lt;br&gt;
&lt;br&gt;
Note that there are many talks available, covering Apache Hadoop,&lt;br&gt;
Apache HTTPD, Lucene, as well as the Apache Pioneer&amp;#39;s Panel and&lt;br&gt;
keynote presentations.&lt;br&gt;
&lt;br&gt;
Lucene&amp;#39;s track is this Friday (NOTE these times are UTC -- use&lt;br&gt;
&lt;a href=&quot;http://www.timeanddate.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://www.timeanddate.com&lt;/a&gt; to map to your time zone):&lt;br&gt;
&lt;br&gt;
 17:00 Implementing an Information Retrieval Framework for an&lt;br&gt;
       Organizational Repository, Sithu D Sudarsan&lt;br&gt;
&lt;br&gt;
 18:00 Apache Mahout - Going from raw data to information&lt;br&gt;
       Isabel Drost&lt;br&gt;
&lt;br&gt;
 19:15 MIME Magic with Apache Tika&lt;br&gt;
       Jukka Zitting&lt;br&gt;
&lt;br&gt;
 20:15 Keynote: How Open Source Developers Can (Still!) Save The World&lt;br&gt;
       Brian Behlendorf&lt;br&gt;
&lt;br&gt;
 22:00 Building Intelligent Search Applications with the Lucene&lt;br&gt;
       Ecosystem, Ted Dunning&lt;br&gt;
&lt;br&gt;
 23:00 Realtime Search&lt;br&gt;
       Jason Rutherglen&lt;br&gt;
&lt;br&gt;
Happy viewing,&lt;br&gt;
&lt;br&gt;
Mike&lt;br&gt;
&lt;/blockquote&gt;&lt;/div&gt;&lt;br&gt;&lt;br clear=&quot;all&quot;&gt;&lt;br&gt;-- &lt;br&gt;&amp;quot;Good Enough&amp;quot; is not good enough.&lt;br&gt;To give anything less than your best is to sacrifice the gift.&lt;br&gt;Quality First. Measure Twice. Cut Once.&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Free-live-video-streaming-of-ApacheCon-US-2009-tp26196267p26206065.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26196267</id>
	<title>Free live video streaming of ApacheCon US 2009</title>
	<published>2009-11-04T05:25:25Z</published>
	<updated>2009-11-04T05:25:25Z</updated>
	<author>
		<name>Michael McCandless-2</name>
	</author>
	<content type="html">Team,
&lt;br&gt;&lt;br&gt;For those Lucene fanatics not in Oakland this week for ApacheCon US,
&lt;br&gt;don't miss the FREE live video streaming, starting today:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &lt;a href=&quot;http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://streaming.linux-magazin.de/en/program-apachecon-us-2009.htm&lt;/a&gt;&lt;br&gt;&lt;br&gt;Note that there are many talks available, covering Apache Hadoop,
&lt;br&gt;Apache HTTPD, Lucene, as well as the Apache Pioneer's Panel and
&lt;br&gt;keynote presentations.
&lt;br&gt;&lt;br&gt;Lucene's track is this Friday (NOTE these times are UTC -- use
&lt;br&gt;&lt;a href=&quot;http://www.timeanddate.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.timeanddate.com&lt;/a&gt;&amp;nbsp;to map to your time zone):
&lt;br&gt;&lt;br&gt;&amp;nbsp;17:00 Implementing an Information Retrieval Framework for an
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Organizational Repository, Sithu D Sudarsan
&lt;br&gt;&lt;br&gt;&amp;nbsp;18:00 Apache Mahout - Going from raw data to information
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Isabel Drost
&lt;br&gt;&lt;br&gt;&amp;nbsp;19:15 MIME Magic with Apache Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Jukka Zitting
&lt;br&gt;&lt;br&gt;&amp;nbsp;20:15 Keynote: How Open Source Developers Can (Still!) Save The World
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Brian Behlendorf
&lt;br&gt;&lt;br&gt;&amp;nbsp;22:00 Building Intelligent Search Applications with the Lucene
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Ecosystem, Ted Dunning
&lt;br&gt;&lt;br&gt;&amp;nbsp;23:00 Realtime Search
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Jason Rutherglen
&lt;br&gt;&lt;br&gt;Happy viewing,
&lt;br&gt;&lt;br&gt;Mike
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Free-live-video-streaming-of-ApacheCon-US-2009-tp26196267p26196267.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26181032</id>
	<title>[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB</title>
	<published>2009-11-03T07:05:59Z</published>
	<updated>2009-11-03T07:05:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Julien Nioche updated NUTCH-762:
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: NUTCH-762-MultiGenerator.patch
&lt;br&gt;&lt;br&gt;Patch for the MultiGenerator
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Alternative Generator which can generate several segments in one parse of the crawlDB
&lt;br&gt;&amp;gt; -------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-762
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-762&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-762&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: New Feature
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: generator
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: NUTCH-762-MultiGenerator.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment.
&lt;br&gt;&amp;gt; The patch attached contains an implementation of a MultiGenerator &amp;nbsp;which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: 
&lt;br&gt;&amp;gt; * can filter the URLs by score
&lt;br&gt;&amp;gt; * normalisation is optional
&lt;br&gt;&amp;gt; * IP resolution is done ONLY on the entries which have been selected for &amp;nbsp;fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
&lt;br&gt;&amp;gt; * can max the number of URLs per host or domain (but not by IP)
&lt;br&gt;&amp;gt; * can choose to partition by host, domain or IP
&lt;br&gt;&amp;gt; Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. 
&lt;br&gt;&amp;gt; We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers.
&lt;br&gt;&amp;gt; The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
&lt;br&gt;&amp;gt; with the following options :
&lt;br&gt;&amp;gt; MultiGenerator &amp;lt;crawldb&amp;gt; &amp;lt;segments_dir&amp;gt; [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
&lt;br&gt;&amp;gt; where most parameters are similar to the default Generator - apart from : 
&lt;br&gt;&amp;gt; -noNorm (explicit)
&lt;br&gt;&amp;gt; -topN : max number of URLs per segment
&lt;br&gt;&amp;gt; -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments
&lt;br&gt;&amp;gt; Please give it a try and less me know what you think of it
&lt;br&gt;&amp;gt; Julien Nioche
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp;
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-762%29-Alternative-Generator-which-can-generate-several-segments-in-one-parse-of-the-crawlDB-tp26180999p26181032.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26180999</id>
	<title>[jira] Created: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB</title>
	<published>2009-11-03T07:03:59Z</published>
	<updated>2009-11-03T07:03:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Alternative Generator which can generate several segments in one parse of the crawlDB
&lt;br&gt;-------------------------------------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-762
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-762&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-762&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: New Feature
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: generator
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 1.0.0
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Julien Nioche
&lt;br&gt;&lt;br&gt;&lt;br&gt;When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment.
&lt;br&gt;&lt;br&gt;The patch attached contains an implementation of a MultiGenerator &amp;nbsp;which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: 
&lt;br&gt;* can filter the URLs by score
&lt;br&gt;* normalisation is optional
&lt;br&gt;* IP resolution is done ONLY on the entries which have been selected for &amp;nbsp;fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale
&lt;br&gt;* can max the number of URLs per host or domain (but not by IP)
&lt;br&gt;* can choose to partition by host, domain or IP
&lt;br&gt;&lt;br&gt;Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. 
&lt;br&gt;We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers.
&lt;br&gt;&lt;br&gt;The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ...
&lt;br&gt;with the following options :
&lt;br&gt;MultiGenerator &amp;lt;crawldb&amp;gt; &amp;lt;segments_dir&amp;gt; [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
&lt;br&gt;&lt;br&gt;where most parameters are similar to the default Generator - apart from : 
&lt;br&gt;-noNorm (explicit)
&lt;br&gt;-topN : max number of URLs per segment
&lt;br&gt;-maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments
&lt;br&gt;&lt;br&gt;Please give it a try and less me know what you think of it
&lt;br&gt;&lt;br&gt;Julien Nioche
&lt;br&gt;&lt;a href=&quot;http://www.digitalpebble.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.digitalpebble.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;nbsp;
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-762%29-Alternative-Generator-which-can-generate-several-segments-in-one-parse-of-the-crawlDB-tp26180999p26180999.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26180641</id>
	<title>[jira] Updated: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer</title>
	<published>2009-11-03T06:41:59Z</published>
	<updated>2009-11-03T06:41:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Julien Nioche updated NUTCH-761:
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: optiCrawlReducer.patch
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Avoid cloningCrawlDatum in CrawlDbReducer 
&lt;br&gt;&amp;gt; ------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-761
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: optiCrawlReducer.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
&lt;br&gt;&amp;gt; The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, &amp;nbsp;we noticed an improvement of around 25-30% in the time spent in the reduce phase.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-761%29-Avoid-cloningCrawlDatum-in-CrawlDbReducer-tp26180618p26180641.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26180618</id>
	<title>[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer</title>
	<published>2009-11-03T06:39:59Z</published>
	<updated>2009-11-03T06:39:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Avoid cloningCrawlDatum in CrawlDbReducer 
&lt;br&gt;------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: NUTCH-761
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-761&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-761&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Nutch
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Julien Nioche
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Minor
&lt;br&gt;&lt;br&gt;&lt;br&gt;In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
&lt;br&gt;The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, &amp;nbsp;we noticed an improvement of around 25-30% in the time spent in the reduce phase.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-761%29-Avoid-cloningCrawlDatum-in-CrawlDbReducer-tp26180618p26180618.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26121040</id>
	<title>[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed</title>
	<published>2009-10-29T14:21:59Z</published>
	<updated>2009-10-29T14:21:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12771625#action_12771625&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12771625#action_12771625&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;David Stuart commented on NUTCH-585:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;Hi Andrea,
&lt;br&gt;&lt;br&gt;I hope your week of demo's went well. I to would be interested in this code as I would like to look at extending to it be slightly more generic allowing for regular expression matches or an xpath like model (the plan is still formulating). From the web crawler view it would be a hard one to get right but we have about 26 sites that are will know to us that we wish to crawl and have common blocks that we wish to remove which a configurable version of your code may achieve.
&lt;br&gt;&lt;br&gt;Look forward to see your patch
&lt;br&gt;&lt;br&gt;&lt;br&gt;Regards,
&lt;br&gt;&lt;br&gt;David Stuart
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
&lt;br&gt;&amp;gt; -----------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-585
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-585&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-585&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.9.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: All operating systems
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Andrea Spinelli
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.
&lt;br&gt;&amp;gt; We have modified the plugin so that it ignores HTML code between certain HTML comments, like
&lt;br&gt;&amp;gt; &amp;lt;!-- START-IGNORE --&amp;gt;
&lt;br&gt;&amp;gt; ... ignored part ...
&lt;br&gt;&amp;gt; &amp;lt;!-- STOP-IGNORE --&amp;gt;
&lt;br&gt;&amp;gt; We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).
&lt;br&gt;&amp;gt; We are almost ready to contribute our code snippet. &amp;nbsp;Looking forward for any expression of &amp;nbsp;interest - or for an explanation why waht we are doing is plain wrong!
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-585%29--PARSE-HTML-plugin--Block-certain-parts-of-HTML-code-from-being-indexed-tp14023775p26121040.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26084822</id>
	<title>[Nutch Wiki] Update of &quot;DownloadingNutch&quot; by SteveKearns</title>
	<published>2009-10-27T13:39:43Z</published>
	<updated>2009-10-27T13:39:43Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;DownloadingNutch&amp;quot; page has been changed by SteveKearns.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/DownloadingNutch?action=diff&amp;rev1=5&amp;rev2=6&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/DownloadingNutch?action=diff&amp;rev1=5&amp;rev2=6&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; You have two choices in how to get Nutch:
&lt;br&gt;- &amp;nbsp; 1. You can download a release from &lt;a href=&quot;http://lucene.apache.org/nutch/release/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/release/&lt;/a&gt;. &amp;nbsp;This will give you a relatively stable release. &amp;nbsp;At the moment the latest release is 0.9.
&lt;br&gt;+ &amp;nbsp; 1. You can download a release from &lt;a href=&quot;http://lucene.apache.org/nutch/release/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/release/&lt;/a&gt;. &amp;nbsp;This will give you a relatively stable release. &amp;nbsp;At the moment the latest release is 1.0.
&lt;br&gt;- &amp;nbsp; 2. Or, you can check out the latest source code from subversion and build it with Ant. &amp;nbsp;This gets you closer to the bleeding edge of development. &amp;nbsp;The 0.9 should be relatively stable but the trunk (from which the [[&lt;a href=&quot;http://lucene.apache.org/nutch/nightly.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/nightly.html&lt;/a&gt;|nightly builds]] are build) is under heavy development with bugs showing up and getting squashed fairly frequently. 
&lt;br&gt;+ &amp;nbsp; 2. Or, you can check out the latest source code from subversion and build it with Ant. &amp;nbsp;This gets you closer to the bleeding edge of development. &amp;nbsp;The 1.0 release should be relatively stable but the trunk (from which the [[&lt;a href=&quot;http://lucene.apache.org/nutch/nightly.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/nutch/nightly.html&lt;/a&gt;|nightly builds]] are build) is under heavy development with bugs showing up and getting squashed fairly frequently. 
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; Note: As of 5/29/08 the Subversion trunk seems to be much better than the 0.9 release. If you have trouble with 0.9 your best bet is to try moving to trunk and see if the problems resolve themselves.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22DownloadingNutch%22-by-SteveKearns-tp26084822p26084822.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26077490</id>
	<title>[Nutch Wiki] Update of &quot;ApacheConUs2009MeetUp&quot; by KenKrugler</title>
	<published>2009-10-27T06:12:54Z</published>
	<updated>2009-10-27T06:12:54Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;ApacheConUs2009MeetUp&amp;quot; page has been changed by KenKrugler.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=4&amp;rev2=5&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&amp;rev1=4&amp;rev2=5&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;- We're planning to have a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;+ We were planning to have a &amp;quot;Web Crawler Developer&amp;quot; !MeetUp at this year's [[&lt;a href=&quot;http://www.us.apachecon.com/c/acus2009/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.us.apachecon.com/c/acus2009/&lt;/a&gt;|ApacheCon US]] in Oakland.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [[&lt;a href=&quot;http://wiki.apache.org/apachecon/ApacheMeetupsUs09&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/apachecon/ApacheMeetupsUs09&lt;/a&gt;|here]].
&lt;br&gt;+ Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp.
&lt;br&gt;+ 
&lt;br&gt;+ So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :)
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; Below are some potential topics for discussion - feel free to add/comment.
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Update-of-%22ApacheConUs2009MeetUp%22-by-KenKrugler-tp26077490p26077490.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26075939</id>
	<title>[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-10-27T04:16:59Z</published>
	<updated>2009-10-27T04:16:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12770464#action_12770464&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12770464#action_12770464&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;David Stuart commented on NUTCH-760:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;Hi Andrzej,
&lt;br&gt;&lt;br&gt;I have amended the patch to incorporate your suggestions
&lt;br&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&lt;br&gt;Regards,
&lt;br&gt;&lt;br&gt;&lt;br&gt;Dave 
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p26075939.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26075730</id>
	<title>[jira] Updated: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-10-27T03:58:59Z</published>
	<updated>2009-10-27T03:58:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;David Stuart updated NUTCH-760:
&lt;br&gt;-------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: solrindex_schema.patch
&lt;br&gt;&lt;br&gt;Have updated patch as per comment below
&lt;br&gt;&amp;nbsp; &amp;nbsp; * &amp;nbsp;the description of the property in nutch-default.xml could be more descriptive
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; * &amp;lt;schema&amp;gt; element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; * SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; * consequently, static references to SolrSchemaReader need to be un-staticized in other places.
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; * minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p26075730.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26056403</id>
	<title>How to index files only with specific type</title>
	<published>2009-10-26T02:11:58Z</published>
	<updated>2009-10-26T02:11:58Z</updated>
	<author>
		<name>funnyduck</name>
	</author>
	<content type="html">Hi, I&amp;#39;ve create parser and indexer to specific file type(geo xml meta file - kml).&lt;br&gt;I am trying to crawl couple of sites, and index only files of this type.&lt;br&gt;I don&amp;#39;t want to index html or anything else.&lt;br&gt;How can I achieve this?&lt;br&gt;
Thanks.&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/How-to-index-files-only-with-specific-type-tp26056403p26056403.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26055459</id>
	<title>[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL</title>
	<published>2009-10-26T00:39:59Z</published>
	<updated>2009-10-26T00:39:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12769926#action_12769926&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12769926#action_12769926&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Reinhard Schwab commented on NUTCH-755:
&lt;br&gt;---------------------------------------
&lt;br&gt;&lt;br&gt;in the first case, the &amp;quot;url&amp;quot; is rejected or?
&lt;br&gt;the filter method will return null.
&lt;br&gt;&lt;br&gt;catch (Exception e) {
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; // if an error happens, allow the url to pass
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; LOG.error(&amp;quot;Could not apply filter on url: &amp;quot; + url + &amp;quot;\n&amp;quot;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; + org.apache.hadoop.util.StringUtils.stringifyException(e));
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; return null;
&lt;br&gt;&amp;nbsp; &amp;nbsp; }
&lt;br&gt;&lt;br&gt;the comment in the code is wrong.
&lt;br&gt;if the method returns null, the url does not pass.
&lt;br&gt;&lt;br&gt;the malformed check is done by java.net.URL constructor.
&lt;br&gt;it accepts http:/comments.php
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; DomainURLFilter crashes on malformed URL
&lt;br&gt;&amp;gt; ----------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-755
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-755&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-755&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: fetcher
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 1.0.0
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Tomcat 6.0.14
&lt;br&gt;&amp;gt; Java 1.6.0_14
&lt;br&gt;&amp;gt; Linux
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Mike Baranczak
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply filter on url: http:/comments.php
&lt;br&gt;&amp;gt; java.lang.NullPointerException
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
&lt;br&gt;&amp;gt; Expected behavior would be to recognize the URL as malformed, and reject it.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-755%29-DomainURLFilter-crashes-on-malformed-URL-tp25484266p26055459.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26040751</id>
	<title>[Nutch Wiki] Trivial Update of &quot;首页&quot; by yongping8204</title>
	<published>2009-10-24T09:37:39Z</published>
	<updated>2009-10-24T09:37:39Z</updated>
	<author>
		<name>Apache Wiki</name>
	</author>
	<content type="html">Dear Wiki user,
&lt;br&gt;&lt;br&gt;You have subscribed to a wiki page or wiki category on &amp;quot;Nutch Wiki&amp;quot; for change notification.
&lt;br&gt;&lt;br&gt;The &amp;quot;首页&amp;quot; page has been changed by yongping8204.
&lt;br&gt;&lt;a href=&quot;http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diff&amp;rev1=4&amp;rev2=5&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diff&amp;rev1=4&amp;rev2=5&lt;/a&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; #format wiki
&lt;br&gt;&amp;nbsp; #language zh
&lt;br&gt;&amp;nbsp; #pragma section-numbers off
&lt;br&gt;- 
&lt;br&gt;&amp;nbsp; = 维基链接名 维基 =
&lt;br&gt;&amp;nbsp; 您也许可以从这些连接开始:
&lt;br&gt;+ 
&lt;br&gt;- &amp;nbsp;* [[最新改动]]: 谁最近改动了什么
&lt;br&gt;+ &amp;nbsp;* [[最新改动]]: 谁最近改动了什么 (我在修改)
&lt;br&gt;&amp;nbsp; &amp;nbsp;* [[维基沙盘演练]]: 您可以随意改动编辑，热身演练
&lt;br&gt;&amp;nbsp; &amp;nbsp;* [[查找网页]]: 用多种方法搜索浏览这个站点
&lt;br&gt;- &amp;nbsp;* [[语法参考]]: 维基语法简便参考 
&lt;br&gt;+ &amp;nbsp;* [[语法参考]]: 维基语法简便参考
&lt;br&gt;&amp;nbsp; &amp;nbsp;* [[站点导航]]: 本站点内容概要
&lt;br&gt;+ 
&lt;br&gt;&amp;nbsp; 这个维基是有关什么的?
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;&amp;nbsp; 测试
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;+ == 如何使用这个站点 ==
&lt;br&gt;+ 维基(wiki)是一种协同合作网站，任何人都可以参与网站的建立、编辑和维护并分享网站的内容：
&lt;br&gt;&amp;nbsp; 
&lt;br&gt;- == 如何使用这个站点 ==
&lt;br&gt;- 
&lt;br&gt;- 维基(wiki)是一种协同合作网站，任何人都可以参与网站的建立、编辑和维护并分享网站的内容：
&lt;br&gt;&amp;nbsp; &amp;nbsp;* 点击每个网页页眉或页尾中的'''&amp;lt;&amp;lt;GetText(Edit)&amp;gt;&amp;gt;'''就可以随意编辑改动这个网页。
&lt;br&gt;&amp;nbsp; &amp;nbsp;* 创建一个链接简单的不能再简单了：您可以使用连在一起的，每个单词第一个字母大写，但不用空格分隔的词组(比如WikiSandBox)，也可以用{{{[&amp;quot;quoted words in brackets&amp;quot;]}}}。简体中文的链接可以使用后者，比如{{{[&amp;quot;维基沙盘演练&amp;quot;]}}}。
&lt;br&gt;&amp;nbsp; &amp;nbsp;* 每页的页眉中的搜索框可以用来将进行网页标题搜索或者进行全文检索。
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-Nutch-Wiki--Trivial-Update-of-%22%E9%A6%96%E9%A1%B5%22-by-yongping8204-tp26040751p26040751.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26003897</id>
	<title>Re: datanode.BlockAlreadyExistsException</title>
	<published>2009-10-21T21:29:25Z</published>
	<updated>2009-10-21T21:29:25Z</updated>
	<author>
		<name>Jesse Hires</name>
	</author>
	<content type="html">I am still getting the same errors.&lt;br&gt;&lt;br&gt;I ran fsck (rebooted with a forced fsck at startup) No issues.&lt;br&gt;I increased the ulimit to 8192&lt;br&gt;&lt;br&gt;I was using /etc/hosts for all name lookups (common across all machines and copied from same location). I have since modified hadoop-site.xml and the slaves file to use IP address only.&lt;br&gt;
&lt;br&gt;Using ifconfig, ping, and looking at /etc/sysconfig/network I&amp;#39;ve determined that all the machines are who they think they are.&lt;br&gt;&lt;br&gt;Of note may be that I also get the following WARN in the logs after the BlockAlreadyExistsException. I am seeing the same on both datanodes (just swap the IP addresses)&lt;br&gt;
&lt;pre&gt;2009-10-21 21:13:03,415 WARN  datanode.DataNode - DatanodeRegistration(&lt;a href=&quot;http://192.168.1.7:50010&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;192.168.1.7:50010&lt;/a&gt;, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):Failed to transfer blk_-2053461958845826983_3919 to &lt;a href=&quot;http://192.168.1.6:50010&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;192.168.1.6:50010&lt;/a&gt; got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer&lt;br&gt;
	at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)&lt;br&gt;	at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:456)&lt;br&gt;	at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:557)&lt;br&gt;	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:199)&lt;br&gt;
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1108)&lt;br&gt;
	at java.lang.Thread.run(Thread.java:636)&lt;br&gt;Caused by: java.io.IOException: Connection reset by peer&lt;br&gt;&lt;/pre&gt;&lt;br clear=&quot;all&quot;&gt;I am able to generate/fetch/updatedb/etc....  As near as I can tell, things seem to be working, but I really wouldn&amp;#39;t know if I am missing anything anyway. No errors are being displayed on the command line. Every iteration seems to be growing the index, segments, linkdb accordingly.&lt;br&gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Here is the hadoop-site.xml&lt;br&gt;&amp;lt;?xml version=&amp;quot;1.0&amp;quot;?&amp;gt;&lt;br&gt;&amp;lt;?xml-stylesheet type=&amp;quot;text/xsl&amp;quot; href=&amp;quot;configuration.xsl&amp;quot;?&amp;gt;&lt;br&gt;&lt;br&gt;&amp;lt;!-- Put site-specific property overrides in this file. --&amp;gt;&lt;br&gt;
&lt;br&gt;&amp;lt;configuration&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;&lt;a href=&quot;http://fs.default.name&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;fs.default.name&lt;/a&gt;&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;hdfs://&lt;a href=&quot;http://192.168.1.3:9000&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;192.168.1.3:9000&lt;/a&gt;&amp;lt;/value&amp;gt;&lt;br&gt;
  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;hdfs://&lt;a href=&quot;http://192.168.1.3:9001&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;192.168.1.3:9001&lt;/a&gt;&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;
  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;mapred.tasktracker.tasks.maximum&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;1&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;mapred.child.java.opts&amp;lt;/name&amp;gt;&lt;br&gt;
    &amp;lt;value&amp;gt;-Xmx512m&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;/home/nutch/crawl/filesystem/name&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;
&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;/home/nutch/crawl/filesystem/data&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;mapred.system.dir&amp;lt;/name&amp;gt;&lt;br&gt;
    &amp;lt;value&amp;gt;/home/nutch/crawl/filesystem/mapreduce/system&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;/home/nutch/crawl/filesystem/mapreduce/local&amp;lt;/value&amp;gt;&lt;br&gt;
  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;  &amp;lt;property&amp;gt;&lt;br&gt;    &amp;lt;name&amp;gt;dfs.replication&amp;lt;/name&amp;gt;&lt;br&gt;    &amp;lt;value&amp;gt;1&amp;lt;/value&amp;gt;&lt;br&gt;  &amp;lt;/property&amp;gt;&lt;br&gt;&lt;br&gt;&amp;lt;/configuration&amp;gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Jesse&lt;br&gt;
&lt;br&gt;int GetRandomNumber()&lt;br&gt;{&lt;br&gt;    return 4; // Chosen by fair roll of dice&lt;br&gt;                 // Guaranteed to be random&lt;br&gt;} // &lt;a href=&quot;http://xkcd.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;xkcd.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;&lt;div class=&quot;gmail_quote&quot;&gt;On Wed, Oct 21, 2009 at 6:01 PM, Jesse Hires &lt;span dir=&quot;ltr&quot;&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26003897&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;jhires@...&lt;/a&gt;&amp;gt;&lt;/span&gt; wrote:&lt;br&gt;&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
Thanks for the pointers!&lt;br&gt;As soon as I have some results, I&amp;#39;ll post them back and let you know if the problem is solved.&lt;div class=&quot;im&quot;&gt;&lt;br&gt;&lt;br clear=&quot;all&quot;&gt;Jesse&lt;br&gt;&lt;br&gt;int GetRandomNumber()&lt;br&gt;{&lt;br&gt;    return 4; // Chosen by fair roll of dice&lt;br&gt;

                 // Guaranteed to be random&lt;br&gt;} // &lt;a href=&quot;http://xkcd.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;xkcd.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;/div&gt;&lt;div class=&quot;h5&quot;&gt;&lt;div class=&quot;gmail_quote&quot;&gt;On Wed, Oct 21, 2009 at 4:46 AM, Andrzej Bialecki &lt;span dir=&quot;ltr&quot;&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26003897&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;&lt;/span&gt; wrote:&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
&lt;div&gt;Jesse Hires wrote:&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
I tried asking this over at the nutch-user alias, but I am seeing very little traction, so I thought I&amp;#39;d ask the developers. I realize this is most likely a configuration problem on my end, but I am very new to using nutch, so I am having a difficult time understanding where I need to look.&lt;br&gt;


&lt;br&gt;
Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on?&lt;br&gt;


&lt;/blockquote&gt;
&lt;br&gt;&lt;/div&gt;
It&amp;#39;s not expected at all - this usually indicates some config error, or FS corruption, or it may be also caused by conflicting DNS (e.g. the same name resolving to different addresses on different nodes), or a problem with permissions (e.g. daemon started remotely uses uid/permissions/env that doesn&amp;#39;t allow it to create/delete files in data dir). This may be also some weird corner case when processes run out of file descriptors - you should check ulimit -n and set it to a value higher than 4096.&lt;br&gt;


&lt;br&gt;
Please also run fsck / and see what it says.&lt;div&gt;&lt;br&gt;
&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
I can also provide config files if needed.&lt;br&gt;
&lt;/blockquote&gt;
&lt;br&gt;&lt;/div&gt;
We need just the modifications in hadoop-site.xml, that&amp;#39;s where the problem may be located.&lt;br&gt;&lt;font color=&quot;#888888&quot;&gt;
&lt;br&gt;
&lt;br&gt;
--&lt;br&gt;
Best regards,&lt;br&gt;
Andrzej Bialecki     &amp;lt;&amp;gt;&amp;lt;&lt;br&gt;
 ___. ___ ___ ___ _ _   __________________________________&lt;br&gt;
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web&lt;br&gt;
___|||__||  \|  ||  |  Embedded Unix, System Integration&lt;br&gt;
&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;  Contact: info at sigram dot com&lt;br&gt;
&lt;br&gt;
&lt;/font&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/datanode.BlockAlreadyExistsException-tp25984146p26003897.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26002577</id>
	<title>Re: datanode.BlockAlreadyExistsException</title>
	<published>2009-10-21T18:01:03Z</published>
	<updated>2009-10-21T18:01:03Z</updated>
	<author>
		<name>Jesse Hires</name>
	</author>
	<content type="html">Thanks for the pointers!&lt;br&gt;As soon as I have some results, I&amp;#39;ll post them back and let you know if the problem is solved.&lt;br&gt;&lt;br clear=&quot;all&quot;&gt;Jesse&lt;br&gt;&lt;br&gt;int GetRandomNumber()&lt;br&gt;{&lt;br&gt;    return 4; // Chosen by fair roll of dice&lt;br&gt;
                 // Guaranteed to be random&lt;br&gt;} // &lt;a href=&quot;http://xkcd.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;xkcd.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;&lt;div class=&quot;gmail_quote&quot;&gt;On Wed, Oct 21, 2009 at 4:46 AM, Andrzej Bialecki &lt;span dir=&quot;ltr&quot;&gt;&amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26002577&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;ab@...&lt;/a&gt;&amp;gt;&lt;/span&gt; wrote:&lt;br&gt;&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
&lt;div class=&quot;im&quot;&gt;Jesse Hires wrote:&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
I tried asking this over at the nutch-user alias, but I am seeing very little traction, so I thought I&amp;#39;d ask the developers. I realize this is most likely a configuration problem on my end, but I am very new to using nutch, so I am having a difficult time understanding where I need to look.&lt;br&gt;

&lt;br&gt;
Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on?&lt;br&gt;

&lt;/blockquote&gt;
&lt;br&gt;&lt;/div&gt;
It&amp;#39;s not expected at all - this usually indicates some config error, or FS corruption, or it may be also caused by conflicting DNS (e.g. the same name resolving to different addresses on different nodes), or a problem with permissions (e.g. daemon started remotely uses uid/permissions/env that doesn&amp;#39;t allow it to create/delete files in data dir). This may be also some weird corner case when processes run out of file descriptors - you should check ulimit -n and set it to a value higher than 4096.&lt;br&gt;

&lt;br&gt;
Please also run fsck / and see what it says.&lt;div class=&quot;im&quot;&gt;&lt;br&gt;
&lt;br&gt;
&lt;blockquote class=&quot;gmail_quote&quot; style=&quot;border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;&quot;&gt;
I can also provide config files if needed.&lt;br&gt;
&lt;/blockquote&gt;
&lt;br&gt;&lt;/div&gt;
We need just the modifications in hadoop-site.xml, that&amp;#39;s where the problem may be located.&lt;br&gt;&lt;font color=&quot;#888888&quot;&gt;
&lt;br&gt;
&lt;br&gt;
--&lt;br&gt;
Best regards,&lt;br&gt;
Andrzej Bialecki     &amp;lt;&amp;gt;&amp;lt;&lt;br&gt;
 ___. ___ ___ ___ _ _   __________________________________&lt;br&gt;
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web&lt;br&gt;
___|||__||  \|  ||  |  Embedded Unix, System Integration&lt;br&gt;
&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;  Contact: info at sigram dot com&lt;br&gt;
&lt;br&gt;
&lt;/font&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/datanode.BlockAlreadyExistsException-tp25984146p26002577.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-25990965</id>
	<title>Re: datanode.BlockAlreadyExistsException</title>
	<published>2009-10-21T04:46:27Z</published>
	<updated>2009-10-21T04:46:27Z</updated>
	<author>
		<name>Andrzej Bialecki</name>
	</author>
	<content type="html">Jesse Hires wrote:
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; I tried asking this over at the nutch-user alias, but I am seeing very 
&lt;br&gt;&amp;gt; little traction, so I thought I'd ask the developers. I realize this is 
&lt;br&gt;&amp;gt; most likely a configuration problem on my end, but I am very new to 
&lt;br&gt;&amp;gt; using nutch, so I am having a difficult time understanding where I need 
&lt;br&gt;&amp;gt; to look.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Does anyone have any insight into the following error I am seeing in the 
&lt;br&gt;&amp;gt; hadoop logs? Is this something I should be concerned with, or is it 
&lt;br&gt;&amp;gt; expected that this shows up in the logs from time to time? If it is not 
&lt;br&gt;&amp;gt; expected, where can I look for more information on what is going on?
&lt;/div&gt;&lt;br&gt;It's not expected at all - this usually indicates some config error, or 
&lt;br&gt;FS corruption, or it may be also caused by conflicting DNS (e.g. the 
&lt;br&gt;same name resolving to different addresses on different nodes), or a 
&lt;br&gt;problem with permissions (e.g. daemon started remotely uses 
&lt;br&gt;uid/permissions/env that doesn't allow it to create/delete files in data 
&lt;br&gt;dir). This may be also some weird corner case when processes run out of 
&lt;br&gt;file descriptors - you should check ulimit -n and set it to a value 
&lt;br&gt;higher than 4096.
&lt;br&gt;&lt;br&gt;Please also run fsck / and see what it says.
&lt;br&gt;&lt;br&gt;&amp;gt; I can also provide config files if needed.
&lt;br&gt;&lt;br&gt;We need just the modifications in hadoop-site.xml, that's where the 
&lt;br&gt;problem may be located.
&lt;br&gt;&lt;br&gt;&lt;br&gt;--
&lt;br&gt;Best regards,
&lt;br&gt;Andrzej Bialecki &amp;nbsp; &amp;nbsp; &amp;lt;&amp;gt;&amp;lt;
&lt;br&gt;&amp;nbsp; ___. ___ ___ ___ _ _ &amp;nbsp; __________________________________
&lt;br&gt;[__ || __|__/|__||\/| &amp;nbsp;Information Retrieval, Semantic Web
&lt;br&gt;___|||__|| &amp;nbsp;\| &amp;nbsp;|| &amp;nbsp;| &amp;nbsp;Embedded Unix, System Integration
&lt;br&gt;&lt;a href=&quot;http://www.sigram.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.sigram.com&lt;/a&gt;&amp;nbsp; Contact: info at sigram dot com
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/datanode.BlockAlreadyExistsException-tp25984146p25990965.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-25984146</id>
	<title>datanode.BlockAlreadyExistsException</title>
	<published>2009-10-20T16:22:05Z</published>
	<updated>2009-10-20T16:22:05Z</updated>
	<author>
		<name>Jesse Hires</name>
	</author>
	<content type="html">I tried asking this over at the nutch-user alias, but I am seeing very little traction, so I thought I&amp;#39;d ask the developers. I realize this is most likely a configuration problem on my end, but I am very new to using nutch, so I am having a difficult time understanding where I need to look.&lt;br&gt;
&lt;br&gt;Does anyone have any insight into the following error I am seeing in
the hadoop logs? Is this something I should be concerned with, or is it
expected that this shows up in the logs from time to time? If it is not
expected, where can I look for more information on what is going on?&lt;br&gt;
&lt;br&gt;&lt;pre&gt;2009-10-16 17:02:43,061 ERROR datanode.DataNode - DatanodeRegistration(&lt;a href=&quot;http://192.168.1.7:50010/&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;192.168.1.7:50010&lt;/a&gt;, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):DataXceiver&lt;br&gt;
&lt;br&gt;org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_909837363833332565_3277 is valid, and cannot be written to.&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:975)&lt;br&gt;
&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.&amp;lt;init&amp;gt;(BlockReceiver.java:97)&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:259)&lt;br&gt;	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)&lt;br&gt;
&lt;br&gt;	at java.lang.Thread.run(Thread.java:636)&lt;br&gt;&lt;/pre&gt;&lt;br&gt;&lt;br&gt;I
am able to produce this just injecting the urls (2 of them), but it
shows up on both datanodes, and happens whenever I run an operation
that uses dfs.&lt;br&gt;&lt;br&gt;I am running the latest sources from the trunk. &lt;br&gt;I&amp;#39;ve verified that only one instance of the following on the datanodes:&lt;br&gt;&lt;div style=&quot;margin-left: 40px;&quot;&gt;org.apache.hadoop.hdfs.server.datanode.DataNode&lt;br&gt;
org.apache.hadoop.mapred.TaskTracker&lt;br&gt;&lt;/div&gt;&lt;br&gt;I&amp;#39;ve also verified that only one instance of the following are running on the name node:&lt;br&gt;&lt;div style=&quot;margin-left: 40px;&quot;&gt;org.apache.hadoop.hdfs.server.namenode.NameNode&lt;br&gt;
org.apache.hadoop.mapred.JobTracker&lt;br&gt;&lt;/div&gt;&lt;br&gt;&lt;br&gt;The hardware is as follows:&lt;br&gt;Two data nodes, both configured identical. Atom 330 proc, 2gigs ram, 320g SATA 3.0 hard drive, Fedora Core 10.&lt;br&gt;One name node, running some amd x86 proc, 2 gigs memory, 750g SATA, Fedora Core 10. (pieced together from spare parts)&lt;br&gt;

All across a 100mb network.&lt;br&gt;Admittedly this is low end hardware, but I am doing this specifically as an exercise in using low power (as in electricity)  hardware.&lt;br&gt;&lt;br&gt;I can also provide config files if needed.&lt;br&gt;&lt;br clear=&quot;all&quot;&gt;
Jesse&lt;br&gt;&lt;br&gt;int GetRandomNumber()&lt;br&gt;{&lt;br&gt;    return 4; // Chosen by fair roll of dice&lt;br&gt;                 // Guaranteed to be random&lt;br&gt;} // &lt;a href=&quot;http://xkcd.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;xkcd.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/datanode.BlockAlreadyExistsException-tp25984146p25984146.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-25982331</id>
	<title>[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-10-20T13:50:59Z</published>
	<updated>2009-10-20T13:50:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12767934#action_12767934&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12767934#action_12767934&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;David Stuart commented on NUTCH-760:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;Thanks,
&lt;br&gt;&lt;br&gt;I will have another go. It quite a big task getting my head around all of the
&lt;br&gt;ins and outs of nutch but its good to help to contribute to a great product
&lt;br&gt;&lt;br&gt;Regards,
&lt;br&gt;&lt;br&gt;Dave
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p25982331.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-25981273</id>
	<title>[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index</title>
	<published>2009-10-20T12:42:59Z</published>
	<updated>2009-10-20T12:42:59Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12767918#action_12767918&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12767918#action_12767918&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Andrzej Bialecki &amp;nbsp;commented on NUTCH-760:
&lt;br&gt;-----------------------------------------
&lt;br&gt;&lt;br&gt;A few comments to the latest patch:
&lt;br&gt;&lt;br&gt;* the description of the property in nutch-default.xml could be more descriptive ;)
&lt;br&gt;&lt;br&gt;* &amp;lt;schema&amp;gt; element has name and version attributes - do we really need these? It's not a Solr schema.xml anyway, so we don't have to pretend that we follow the same format.
&lt;br&gt;&lt;br&gt;* SolrSchemaReader uses static instance of NutchConfiguration - this is a big no-no, the whole point of using the property in nutch-default.xml is that you could set different values, and making this field static basically pins down the configuration to the version set on the first instantiation of the class ... Please do as other similar classes do - implement Configurable, or add Configuration to the constructor, and pass the current job configuration where appropriate.
&lt;br&gt;&lt;br&gt;* consequently, static references to SolrSchemaReader need to be un-staticized in other places.
&lt;br&gt;&lt;br&gt;* minor nits: code formatting should use 2 literal spaces indents. There are some accidental changes in NutchBean and SolrWriter.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Allow field mapping from nutch to solr index
&lt;br&gt;&amp;gt; --------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: NUTCH-760
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/NUTCH-760&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/NUTCH-760&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Nutch
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: indexer
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: David Stuart
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I am using nutch to crawl sites and have combined it
&lt;br&gt;&amp;gt; with solr pushing the nutch index using the solrindex command. I have
&lt;br&gt;&amp;gt; set it up as specified on the wiki using the copyField url to id in the
&lt;br&gt;&amp;gt; schema. Whilst this works fine it is stuff's up my inputs from other
&lt;br&gt;&amp;gt; sources in solr (e.g. using the solr data import handler) as they have
&lt;br&gt;&amp;gt; both id's and url's. I have patch that implements a nutch xml schema
&lt;br&gt;&amp;gt; defining what basic nutch fields map to in your solr push.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28NUTCH-760%29-Allow-field-mapping-from-nutch-to-solr-index-tp25906464p25981273.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-25980443</id>
	<title>Re: solr index question</title>
	<published>2009-10-20T11:48:15Z</published>
	<updated>2009-10-20T11:48:15Z</updated>
	<author>
		<name>David Stuart-6</name>
	</author>
	<content type="html">&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
  &lt;head&gt;
    &lt;meta content=&quot;text/html; charset=UTF-8&quot; http-equiv=&quot;Content-Type&quot; /&gt;
    &lt;title&gt;&lt;/title&gt;
  &lt;/head&gt;

  &lt;body&gt;
    Hi Andrzej,&lt;br /&gt;
    &lt;br /&gt;
    updated patch submitted including SolrSearchBean modifications anything else needed? If not how do I get this into trunk?&lt;br /&gt;
    https://issues.apache.org/jira/browse/NUTCH-760&lt;br /&gt;
    &lt;br /&gt;
    Regards,&lt;br /&gt;
    &lt;br /&gt;
    &lt;br /&gt;
    Dave
  &lt;/body&gt;
&lt;/html&gt;
</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/solr-index-question-tp25880491p25980443.html" />
</entry>

</feed>
