nutch crawldb failed for java heap space
Hi,everyone, these days a nutch problem occur to me when I test nutch to
index 2 millions pages.
When then program steps into the reduce stage of crawldb update, the error
messeges gives as below:
Before this test, I try to crawl and index 1 millions pages, nutch goes
well.
I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even to
my memory size 6GB, but make no different. And change the value of
properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time
and again , but it seems no difference.
I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
slaves(tasktrack).
I attach the hadoop-env.sh and hadoop-site.xml files after this
messege, appreciate your help very much.
============================================================
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.<init>(HashMap.java:209)
at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.lang.OutOfMemoryError: Java heap space
at
java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
at
java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
at
java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
at
java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
at
org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
=====================================================================
hadoop-env.sh
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/home/had/nutch-1.0
export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
export NUTCH_HOME=/home/had/nutch-1.0
export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000
export NUTCH_HEAPSIZE=4000
==========================================================================
hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://distributed1:9000/</value>
<description>The name of the default file system. Either the literal
string "local" or a host:port for DFS.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>distributed1:9001</value>
<description>The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce
task.</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>
The maximum number of tasks that will be run simultaneously by
a task tracker. This should be adjusted according to the heap size
per task, the amount of RAM available, and CPU consumption of each task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>799</value>
<description>
This should be a prime number larger than multiple number of slave
hosts,
e.g. for 3 nodes set this to 17
</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>(test set this value to 4096, but make no
difference)
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>29</value>
<description>
This should be a prime number close to a low multiple of slave hosts,
e.g. for 3 nodes set this to 7
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/had/nutch-1.0/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/had/nutch-1.0/filesystem/name</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table. If this is a comma-delimited list of
directories then the name table is replicated in all of the directories, for
redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/had/nutch-1.0/filesystem/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited list of directories,
then data will be stored in all named directories, typically on different
devices. Directories that do not exist are ignored.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications
can be specified when the file is created. The default is used if
replication is not specified in create time.</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1000m</value>
<description>
You can specify other Java options for each map or reduce task here,
but most likely you will want to adjust the heap size.
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
</property>
</configuration>
====================================================================