nutch crawldb failed for java heap space

View: New views
5 Messages — Rating Filter:   Alert me  

nutch crawldb failed for java heap space

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 Hi,everyone, these days a  nutch problem occur to me when I test nutch to
index 2 millions pages.

When then program steps into the reduce stage of crawldb update, the error
messeges gives as below:
Before this test, I try to crawl and index 1 millions pages, nutch goes
well.
I alter  the HADOOP_HEAPSIZE  in the hadoop-env.sh,  1000m,2000m, even  to
my memory size 6GB, but make no different. And change the value of
 properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time
and again , but it seems no difference.

I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
slaves(tasktrack).
I attach the hadoop-env.sh and hadoop-site.xml files after this
messege, appreciate your help very much.
============================================================

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.HashMap.<init>(HashMap.java:209)
        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
        at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
        at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
        at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)


java.lang.OutOfMemoryError: Java heap space
        at
java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
        at
java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
        at
java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
        at
java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
        at
org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
        at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
        at
org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

=====================================================================
hadoop-env.sh

# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/home/had/nutch-1.0
export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
export NUTCH_HOME=/home/had/nutch-1.0
export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
# Extra Java CLASSPATH elements.  Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000
export NUTCH_HEAPSIZE=4000

==========================================================================

hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://distributed1:9000/</value>
  <description>The name of the default file system. Either the literal
string "local" or a host:port for DFS.</description>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>distributed1:9001</value>
  <description>The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce
task.</description>
</property>
<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>799</value>
  <description>
    This should be a prime number larger than multiple number of slave
hosts,
    e.g. for 3 nodes set this to 17
  </description>
</property>
<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>(test  set this value to 4096, but make no
difference)
  <description>The size of buffer for use in sequence files.
  The size of this buffer should probably be a multiple of hardware
  page size (4096 on Intel x86), and it determines how much data is
  buffered during read and write operations.</description>
</property>
<property>
  <name>mapred.reduce.tasks</name>
  <value>29</value>
  <description>
    This should be a prime number close to a low multiple of slave hosts,
    e.g. for 3 nodes set this to 7
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/had/nutch-1.0/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/home/had/nutch-1.0/filesystem/name</value>
  <description>Determines where on the local filesystem the DFS name node
should store the name table. If this is a comma-delimited list of
directories then the name table is replicated in all of the directories, for
redundancy. </description>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/had/nutch-1.0/filesystem/data</value>
  <description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited list of directories,
then data will be stored in all named directories, typically on different
devices. Directories that do not exist are ignored.</description>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication. The actual number of replications
can be specified when the file is created. The default is used if
replication is not specified in create time.</description>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1000m</value>
  <description>
    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.
  </description>
</property>
<property>
  <name>mapred.system.dir</name>
  <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
</property>
<property>
  <name>mapred.local.dir</name>
  <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
</property>

</configuration>
====================================================================

Re: nutch crawldb failed for java heap space

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

These is no one to tell us how to manage a cluster running, so disappointed.

On Fri, Jul 3, 2009 at 12:21 AM, lei wang <nutchmaillist@...> wrote:

>  Hi,everyone, these days a  nutch problem occur to me when I test nutch to
> index 2 millions pages.
>
> When then program steps into the reduce stage of crawldb update, the error
> messeges gives as below:
> Before this test, I try to crawl and index 1 millions pages, nutch goes
> well.
> I alter  the HADOOP_HEAPSIZE  in the hadoop-env.sh,  1000m,2000m, even  to
> my memory size 6GB, but make no different. And change the value of
>  properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time
> and again , but it seems no difference.
>
> I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
> slaves(tasktrack).
> I attach the hadoop-env.sh and hadoop-site.xml files after this
> messege, appreciate your help very much.
> ============================================================
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.HashMap.<init>(HashMap.java:209)
>         at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
>         at
> org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>         at
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
>         at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
>         at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
>         at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
>
> java.lang.OutOfMemoryError: Java heap space
>         at
> java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
>         at
> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
>         at
> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
>         at
> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
>         at
> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
>         at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
>         at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
>         at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>         at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>         at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> =====================================================================
> hadoop-env.sh
>
> # Set Hadoop-specific environment variables here.
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
> # The java implementation to use.  Required.
> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export HADOOP_HOME=/home/had/nutch-1.0
> export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
> export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
> export NUTCH_HOME=/home/had/nutch-1.0
> export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
> # Extra Java CLASSPATH elements.  Optional.
> # export HADOOP_CLASSPATH=
> # The maximum amount of heap to use, in MB. Default is 1000.
> export HADOOP_HEAPSIZE=4000
> export NUTCH_HEAPSIZE=4000
>
> ==========================================================================
>
> hadoop-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://distributed1:9000/</value>
>   <description>The name of the default file system. Either the literal
> string "local" or a host:port for DFS.</description>
> </property>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>distributed1:9001</value>
>   <description>The host and port that the MapReduce job tracker runs at. If
> "local", then jobs are run in-process as a single map and reduce
> task.</description>
> </property>
> <property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>2</value>
>   <description>
>     The maximum number of tasks that will be run simultaneously by
>     a task tracker. This should be adjusted according to the heap size
>     per task, the amount of RAM available, and CPU consumption of each
> task.
>   </description>
> </property>
> <property>
>   <name>mapred.map.tasks</name>
>   <value>799</value>
>   <description>
>     This should be a prime number larger than multiple number of slave
> hosts,
>     e.g. for 3 nodes set this to 17
>   </description>
> </property>
> <property>
>   <name>io.file.buffer.size</name>
>   <value>131072</value>(test  set this value to 4096, but make no
> difference)
>   <description>The size of buffer for use in sequence files.
>   The size of this buffer should probably be a multiple of hardware
>   page size (4096 on Intel x86), and it determines how much data is
>   buffered during read and write operations.</description>
> </property>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>29</value>
>   <description>
>     This should be a prime number close to a low multiple of slave hosts,
>     e.g. for 3 nodes set this to 7
>   </description>
> </property>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/home/had/nutch-1.0/tmp</value>
>   <description>A base for other temporary directories.</description>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/name</value>
>   <description>Determines where on the local filesystem the DFS name node
> should store the name table. If this is a comma-delimited list of
> directories then the name table is replicated in all of the directories, for
> redundancy. </description>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/data</value>
>   <description>Determines where on the local filesystem an DFS data node
> should store its blocks. If this is a comma-delimited list of directories,
> then data will be stored in all named directories, typically on different
> devices. Directories that do not exist are ignored.</description>
> </property>
> <property>
>   <name>dfs.replication</name>
>   <value>1</value>
>   <description>Default block replication. The actual number of replications
> can be specified when the file is created. The default is used if
> replication is not specified in create time.</description>
> </property>
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx1000m</value>
>   <description>
>     You can specify other Java options for each map or reduce task here,
>     but most likely you will want to adjust the heap size.
>   </description>
> </property>
> <property>
>   <name>mapred.system.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
> </property>
> <property>
>   <name>mapred.local.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
> </property>
>
> </configuration>
> ====================================================================
>
>
>

Re: nutch crawldb failed for java heap space

by Julien Nioche-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce
the memory consumption.
You are not setting the heapsize in the right way. This should be done in
the hadoop-site.xml using the parameter mapred.child.java.opts. Changing
hadoop-env modifies the amount of memory to be used for the services
(JobTracker, DataNodes, etc...) but not for the hadoop tasks

HTH

Julien



2009/7/2 lei wang <nutchmaillist@...>

>  Hi,everyone, these days a  nutch problem occur to me when I test nutch to
> index 2 millions pages.
>
> When then program steps into the reduce stage of crawldb update, the error
> messeges gives as below:
> Before this test, I try to crawl and index 1 millions pages, nutch goes
> well.
> I alter  the HADOOP_HEAPSIZE  in the hadoop-env.sh,  1000m,2000m, even  to
> my memory size 6GB, but make no different. And change the value of
>  properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time
> and again , but it seems no difference.
>
> I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
> slaves(tasktrack).
> I attach the hadoop-env.sh and hadoop-site.xml files after this
> messege, appreciate your help very much.
> ============================================================
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>        at java.util.HashMap.<init>(HashMap.java:209)
>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
>        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
>        at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>        at
>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>        at
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
>        at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
>        at
>
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
>        at
>
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
>
> java.lang.OutOfMemoryError: Java heap space
>        at
> java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
>        at
>
> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
>        at
> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
>        at
> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
>        at
>
> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> =====================================================================
> hadoop-env.sh
>
> # Set Hadoop-specific environment variables here.
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
> # The java implementation to use.  Required.
> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export HADOOP_HOME=/home/had/nutch-1.0
> export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
> export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
> export NUTCH_HOME=/home/had/nutch-1.0
> export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
> # Extra Java CLASSPATH elements.  Optional.
> # export HADOOP_CLASSPATH=
> # The maximum amount of heap to use, in MB. Default is 1000.
> export HADOOP_HEAPSIZE=4000
> export NUTCH_HEAPSIZE=4000
>
> ==========================================================================
>
> hadoop-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>  <name>fs.default.name</name>
>  <value>hdfs://distributed1:9000/</value>
>  <description>The name of the default file system. Either the literal
> string "local" or a host:port for DFS.</description>
> </property>
> <property>
>  <name>mapred.job.tracker</name>
>  <value>distributed1:9001</value>
>  <description>The host and port that the MapReduce job tracker runs at. If
> "local", then jobs are run in-process as a single map and reduce
> task.</description>
> </property>
> <property>
>  <name>mapred.tasktracker.tasks.maximum</name>
>  <value>2</value>
>  <description>
>    The maximum number of tasks that will be run simultaneously by
>    a task tracker. This should be adjusted according to the heap size
>    per task, the amount of RAM available, and CPU consumption of each task.
>  </description>
> </property>
> <property>
>  <name>mapred.map.tasks</name>
>  <value>799</value>
>  <description>
>    This should be a prime number larger than multiple number of slave
> hosts,
>    e.g. for 3 nodes set this to 17
>  </description>
> </property>
> <property>
>  <name>io.file.buffer.size</name>
>  <value>131072</value>(test  set this value to 4096, but make no
> difference)
>  <description>The size of buffer for use in sequence files.
>  The size of this buffer should probably be a multiple of hardware
>  page size (4096 on Intel x86), and it determines how much data is
>  buffered during read and write operations.</description>
> </property>
> <property>
>  <name>mapred.reduce.tasks</name>
>  <value>29</value>
>  <description>
>    This should be a prime number close to a low multiple of slave hosts,
>    e.g. for 3 nodes set this to 7
>  </description>
> </property>
>
> <property>
>  <name>hadoop.tmp.dir</name>
>  <value>/home/had/nutch-1.0/tmp</value>
>  <description>A base for other temporary directories.</description>
> </property>
> <property>
>  <name>dfs.name.dir</name>
>  <value>/home/had/nutch-1.0/filesystem/name</value>
>  <description>Determines where on the local filesystem the DFS name node
> should store the name table. If this is a comma-delimited list of
> directories then the name table is replicated in all of the directories,
> for
> redundancy. </description>
> </property>
> <property>
>  <name>dfs.data.dir</name>
>  <value>/home/had/nutch-1.0/filesystem/data</value>
>  <description>Determines where on the local filesystem an DFS data node
> should store its blocks. If this is a comma-delimited list of directories,
> then data will be stored in all named directories, typically on different
> devices. Directories that do not exist are ignored.</description>
> </property>
> <property>
>  <name>dfs.replication</name>
>  <value>1</value>
>  <description>Default block replication. The actual number of replications
> can be specified when the file is created. The default is used if
> replication is not specified in create time.</description>
> </property>
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx1000m</value>
>  <description>
>    You can specify other Java options for each map or reduce task here,
>    but most likely you will want to adjust the heap size.
>  </description>
> </property>
> <property>
>  <name>mapred.system.dir</name>
>  <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
> </property>
> <property>
>  <name>mapred.local.dir</name>
>  <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
> </property>
>
> </configuration>
> ====================================================================
>



--
DigitalPebble Ltd
http://www.digitalpebble.com

Re: nutch crawldb failed for java heap space

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks, i think you are not read my issue carefully.
your suggestion in the reply I have tested for long time.

On Sun, Jul 5, 2009 at 9:46 PM, Julien Nioche <lists.digitalpebble@...
> wrote:

> See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce
> the memory consumption.
> You are not setting the heapsize in the right way. This should be done in
> the hadoop-site.xml using the parameter mapred.child.java.opts. Changing
> hadoop-env modifies the amount of memory to be used for the services
> (JobTracker, DataNodes, etc...) but not for the hadoop tasks
>
> HTH
>
> Julien
>
>
>
> 2009/7/2 lei wang <nutchmaillist@...>
>
> >  Hi,everyone, these days a  nutch problem occur to me when I test nutch
> to
> > index 2 millions pages.
> >
> > When then program steps into the reduce stage of crawldb update, the
> error
> > messeges gives as below:
> > Before this test, I try to crawl and index 1 millions pages, nutch goes
> > well.
> > I alter  the HADOOP_HEAPSIZE  in the hadoop-env.sh,  1000m,2000m, even
>  to
> > my memory size 6GB, but make no different. And change the value of
> >  properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB
> time
> > and again , but it seems no difference.
> >
> > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
> > slaves(tasktrack).
> > I attach the hadoop-env.sh and hadoop-site.xml files after this
> > messege, appreciate your help very much.
> > ============================================================
> >
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >        at java.util.HashMap.<init>(HashMap.java:209)
> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
> >        at
> org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
> >        at
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> >        at
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> >        at
> > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
> >        at
> org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
> >        at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
> >        at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
> >        at
> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
> >        at
> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
> >        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >        at
> > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
> >        at
> >
> >
> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
> >        at
> > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
> >        at
> > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
> >        at
> >
> >
> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
> >        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
> >        at
> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
> >        at
> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
> >        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >
> > =====================================================================
> > hadoop-env.sh
> >
> > # Set Hadoop-specific environment variables here.
> > # The only required environment variable is JAVA_HOME.  All others are
> > # optional.  When running a distributed configuration it is best to
> > # set JAVA_HOME in this file, so that it is correctly defined on
> > # remote nodes.
> > # The java implementation to use.  Required.
> > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> > export JAVA_HOME=/usr/lib/jvm/java-6-sun
> > export HADOOP_HOME=/home/had/nutch-1.0
> > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
> > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
> > export NUTCH_HOME=/home/had/nutch-1.0
> > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
> > # Extra Java CLASSPATH elements.  Optional.
> > # export HADOOP_CLASSPATH=
> > # The maximum amount of heap to use, in MB. Default is 1000.
> > export HADOOP_HEAPSIZE=4000
> > export NUTCH_HEAPSIZE=4000
> >
> >
> ==========================================================================
> >
> > hadoop-site.xml
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > <!-- Put site-specific property overrides in this file. -->
> > <configuration>
> > <property>
> >  <name>fs.default.name</name>
> >  <value>hdfs://distributed1:9000/</value>
> >  <description>The name of the default file system. Either the literal
> > string "local" or a host:port for DFS.</description>
> > </property>
> > <property>
> >  <name>mapred.job.tracker</name>
> >  <value>distributed1:9001</value>
> >  <description>The host and port that the MapReduce job tracker runs at.
> If
> > "local", then jobs are run in-process as a single map and reduce
> > task.</description>
> > </property>
> > <property>
> >  <name>mapred.tasktracker.tasks.maximum</name>
> >  <value>2</value>
> >  <description>
> >    The maximum number of tasks that will be run simultaneously by
> >    a task tracker. This should be adjusted according to the heap size
> >    per task, the amount of RAM available, and CPU consumption of each
> task.
> >  </description>
> > </property>
> > <property>
> >  <name>mapred.map.tasks</name>
> >  <value>799</value>
> >  <description>
> >    This should be a prime number larger than multiple number of slave
> > hosts,
> >    e.g. for 3 nodes set this to 17
> >  </description>
> > </property>
> > <property>
> >  <name>io.file.buffer.size</name>
> >  <value>131072</value>(test  set this value to 4096, but make no
> > difference)
> >  <description>The size of buffer for use in sequence files.
> >  The size of this buffer should probably be a multiple of hardware
> >  page size (4096 on Intel x86), and it determines how much data is
> >  buffered during read and write operations.</description>
> > </property>
> > <property>
> >  <name>mapred.reduce.tasks</name>
> >  <value>29</value>
> >  <description>
> >    This should be a prime number close to a low multiple of slave hosts,
> >    e.g. for 3 nodes set this to 7
> >  </description>
> > </property>
> >
> > <property>
> >  <name>hadoop.tmp.dir</name>
> >  <value>/home/had/nutch-1.0/tmp</value>
> >  <description>A base for other temporary directories.</description>
> > </property>
> > <property>
> >  <name>dfs.name.dir</name>
> >  <value>/home/had/nutch-1.0/filesystem/name</value>
> >  <description>Determines where on the local filesystem the DFS name node
> > should store the name table. If this is a comma-delimited list of
> > directories then the name table is replicated in all of the directories,
> > for
> > redundancy. </description>
> > </property>
> > <property>
> >  <name>dfs.data.dir</name>
> >  <value>/home/had/nutch-1.0/filesystem/data</value>
> >  <description>Determines where on the local filesystem an DFS data node
> > should store its blocks. If this is a comma-delimited list of
> directories,
> > then data will be stored in all named directories, typically on different
> > devices. Directories that do not exist are ignored.</description>
> > </property>
> > <property>
> >  <name>dfs.replication</name>
> >  <value>1</value>
> >  <description>Default block replication. The actual number of
> replications
> > can be specified when the file is created. The default is used if
> > replication is not specified in create time.</description>
> > </property>
> > <property>
> >  <name>mapred.child.java.opts</name>
> >  <value>-Xmx1000m</value>
> >  <description>
> >    You can specify other Java options for each map or reduce task here,
> >    but most likely you will want to adjust the heap size.
> >  </description>
> > </property>
> > <property>
> >  <name>mapred.system.dir</name>
> >  <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
> > </property>
> > <property>
> >  <name>mapred.local.dir</name>
> >  <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
> > </property>
> >
> > </configuration>
> > ====================================================================
> >
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Re: nutch crawldb failed for java heap space

by beyiwork :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Would you give me the detail solution ? Thanks.

On Sun, Jul 5, 2009 at 10:06 PM, lei wang <nutchmaillist@...> wrote:

>
> Thanks, i think you are not read my issue carefully.
> your suggestion in the reply I have tested for long time.
>
>
> On Sun, Jul 5, 2009 at 9:46 PM, Julien Nioche <
> lists.digitalpebble@...> wrote:
>
>> See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce
>> the memory consumption.
>> You are not setting the heapsize in the right way. This should be done in
>> the hadoop-site.xml using the parameter mapred.child.java.opts. Changing
>> hadoop-env modifies the amount of memory to be used for the services
>> (JobTracker, DataNodes, etc...) but not for the hadoop tasks
>>
>> HTH
>>
>> Julien
>>
>>
>>
>> 2009/7/2 lei wang <nutchmaillist@...>
>>
>> >  Hi,everyone, these days a  nutch problem occur to me when I test nutch
>> to
>> > index 2 millions pages.
>> >
>> > When then program steps into the reduce stage of crawldb update, the
>> error
>> > messeges gives as below:
>> > Before this test, I try to crawl and index 1 millions pages, nutch goes
>> > well.
>> > I alter  the HADOOP_HEAPSIZE  in the hadoop-env.sh,  1000m,2000m, even
>>  to
>> > my memory size 6GB, but make no different. And change the value of
>> >  properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB
>> time
>> > and again , but it seems no difference.
>> >
>> > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8
>> > slaves(tasktrack).
>> > I attach the hadoop-env.sh and hadoop-site.xml files after this
>> > messege, appreciate your help very much.
>> > ============================================================
>> >
>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> >        at java.util.HashMap.<init>(HashMap.java:209)
>> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43)
>> >        at
>> org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
>> >        at
>> >
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>> >        at
>> >
>> >
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>> >        at
>> >
>> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
>> >        at
>> org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
>> >        at
>> >
>> >
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
>> >        at
>> >
>> >
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
>> >        at
>> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70)
>> >        at
>> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>> >        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>> >        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>> >
>> >
>> > java.lang.OutOfMemoryError: Java heap space
>> >        at
>> > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234)
>> >        at
>> >
>> >
>> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289)
>> >        at
>> >
>> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613)
>> >        at
>> >
>> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652)
>> >        at
>> >
>> >
>> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46)
>> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
>> >        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
>> >        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>> >        at
>> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>> >        at
>> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>> >        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>> >        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>> >
>> > =====================================================================
>> > hadoop-env.sh
>> >
>> > # Set Hadoop-specific environment variables here.
>> > # The only required environment variable is JAVA_HOME.  All others are
>> > # optional.  When running a distributed configuration it is best to
>> > # set JAVA_HOME in this file, so that it is correctly defined on
>> > # remote nodes.
>> > # The java implementation to use.  Required.
>> > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
>> > export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> > export HADOOP_HOME=/home/had/nutch-1.0
>> > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf
>> > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs
>> > export NUTCH_HOME=/home/had/nutch-1.0
>> > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf
>> > # Extra Java CLASSPATH elements.  Optional.
>> > # export HADOOP_CLASSPATH=
>> > # The maximum amount of heap to use, in MB. Default is 1000.
>> > export HADOOP_HEAPSIZE=4000
>> > export NUTCH_HEAPSIZE=4000
>> >
>> >
>> ==========================================================================
>> >
>> > hadoop-site.xml
>> >
>> > <?xml version="1.0"?>
>> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> > <!-- Put site-specific property overrides in this file. -->
>> > <configuration>
>> > <property>
>> >  <name>fs.default.name</name>
>> >  <value>hdfs://distributed1:9000/</value>
>> >  <description>The name of the default file system. Either the literal
>> > string "local" or a host:port for DFS.</description>
>> > </property>
>> > <property>
>> >  <name>mapred.job.tracker</name>
>> >  <value>distributed1:9001</value>
>> >  <description>The host and port that the MapReduce job tracker runs at.
>> If
>> > "local", then jobs are run in-process as a single map and reduce
>> > task.</description>
>> > </property>
>> > <property>
>> >  <name>mapred.tasktracker.tasks.maximum</name>
>> >  <value>2</value>
>> >  <description>
>> >    The maximum number of tasks that will be run simultaneously by
>> >    a task tracker. This should be adjusted according to the heap size
>> >    per task, the amount of RAM available, and CPU consumption of each
>> task.
>> >  </description>
>> > </property>
>> > <property>
>> >  <name>mapred.map.tasks</name>
>> >  <value>799</value>
>> >  <description>
>> >    This should be a prime number larger than multiple number of slave
>> > hosts,
>> >    e.g. for 3 nodes set this to 17
>> >  </description>
>> > </property>
>> > <property>
>> >  <name>io.file.buffer.size</name>
>> >  <value>131072</value>(test  set this value to 4096, but make no
>> > difference)
>> >  <description>The size of buffer for use in sequence files.
>> >  The size of this buffer should probably be a multiple of hardware
>> >  page size (4096 on Intel x86), and it determines how much data is
>> >  buffered during read and write operations.</description>
>> > </property>
>> > <property>
>> >  <name>mapred.reduce.tasks</name>
>> >  <value>29</value>
>> >  <description>
>> >    This should be a prime number close to a low multiple of slave hosts,
>> >    e.g. for 3 nodes set this to 7
>> >  </description>
>> > </property>
>> >
>> > <property>
>> >  <name>hadoop.tmp.dir</name>
>> >  <value>/home/had/nutch-1.0/tmp</value>
>> >  <description>A base for other temporary directories.</description>
>> > </property>
>> > <property>
>> >  <name>dfs.name.dir</name>
>> >  <value>/home/had/nutch-1.0/filesystem/name</value>
>> >  <description>Determines where on the local filesystem the DFS name node
>> > should store the name table. If this is a comma-delimited list of
>> > directories then the name table is replicated in all of the directories,
>> > for
>> > redundancy. </description>
>> > </property>
>> > <property>
>> >  <name>dfs.data.dir</name>
>> >  <value>/home/had/nutch-1.0/filesystem/data</value>
>> >  <description>Determines where on the local filesystem an DFS data node
>> > should store its blocks. If this is a comma-delimited list of
>> directories,
>> > then data will be stored in all named directories, typically on
>> different
>> > devices. Directories that do not exist are ignored.</description>
>> > </property>
>> > <property>
>> >  <name>dfs.replication</name>
>> >  <value>1</value>
>> >  <description>Default block replication. The actual number of
>> replications
>> > can be specified when the file is created. The default is used if
>> > replication is not specified in create time.</description>
>> > </property>
>> > <property>
>> >  <name>mapred.child.java.opts</name>
>> >  <value>-Xmx1000m</value>
>> >  <description>
>> >    You can specify other Java options for each map or reduce task here,
>> >    but most likely you will want to adjust the heap size.
>> >  </description>
>> > </property>
>> > <property>
>> >  <name>mapred.system.dir</name>
>> >  <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
>> > </property>
>> > <property>
>> >  <name>mapred.local.dir</name>
>> >  <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
>> > </property>
>> >
>> > </configuration>
>> > ====================================================================
>> >
>>
>>
>>
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>
>