I am still getting the same errors.
I ran fsck (rebooted with a forced fsck at startup) No issues.
I increased the ulimit to 8192
I was using /etc/hosts for all name lookups (common across all machines and copied from same location). I have since modified hadoop-site.xml and the slaves file to use IP address only.
Using ifconfig, ping, and looking at /etc/sysconfig/network I've determined that all the machines are who they think they are.
Of note may be that I also get the following WARN in the logs after the BlockAlreadyExistsException. I am seeing the same on both datanodes (just swap the IP addresses)
2009-10-21 21:13:03,415 WARN datanode.DataNode - DatanodeRegistration(192.168.1.7:50010, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):Failed to transfer blk_-2053461958845826983_3919 to 192.168.1.6:50010 got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:456)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:557)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:199)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1108)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Connection reset by peer
I am able to generate/fetch/updatedb/etc.... As near as I can tell, things seem to be working, but I really wouldn't know if I am missing anything anyway. No errors are being displayed on the command line. Every iteration seems to be growing the index, segments, linkdb accordingly.
Here is the hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>
fs.default.name</name>
<value>hdfs://
192.168.1.3:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://
192.168.1.3:9001</value>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/nutch/crawl/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/nutch/crawl/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/nutch/crawl/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/nutch/crawl/filesystem/mapreduce/local</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} //
xkcd.com
On Wed, Oct 21, 2009 at 6:01 PM, Jesse Hires
<jhires@...> wrote:
Thanks for the pointers!
As soon as I have some results, I'll post them back and let you know if the problem is solved.Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} //
xkcd.com
On Wed, Oct 21, 2009 at 4:46 AM, Andrzej Bialecki
<ab@...> wrote:
Jesse Hires wrote:
I tried asking this over at the nutch-user alias, but I am seeing very little traction, so I thought I'd ask the developers. I realize this is most likely a configuration problem on my end, but I am very new to using nutch, so I am having a difficult time understanding where I need to look.
Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on?
It's not expected at all - this usually indicates some config error, or FS corruption, or it may be also caused by conflicting DNS (e.g. the same name resolving to different addresses on different nodes), or a problem with permissions (e.g. daemon started remotely uses uid/permissions/env that doesn't allow it to create/delete files in data dir). This may be also some weird corner case when processes run out of file descriptors - you should check ulimit -n and set it to a value higher than 4096.
Please also run fsck / and see what it says.
I can also provide config files if needed.
We need just the modifications in hadoop-site.xml, that's where the problem may be located.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com