|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
nutch crawldb failed for java heap space Hi,everyone, these days a nutch problem occur to me when I test nutch to
index 2 millions pages. When then program steps into the reduce stage of crawldb update, the error messeges gives as below: Before this test, I try to crawl and index 1 millions pages, nutch goes well. I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even to my memory size 6GB, but make no different. And change the value of properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time and again , but it seems no difference. I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 slaves(tasktrack). I attach the hadoop-env.sh and hadoop-site.xml files after this messege, appreciate your help very much. ============================================================ java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.HashMap.<init>(HashMap.java:209) at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) at java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) at java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) at java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) at org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) ===================================================================== hadoop-env.sh # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun export JAVA_HOME=/usr/lib/jvm/java-6-sun export HADOOP_HOME=/home/had/nutch-1.0 export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs export NUTCH_HOME=/home/had/nutch-1.0 export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf # Extra Java CLASSPATH elements. Optional. # export HADOOP_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=4000 export NUTCH_HEAPSIZE=4000 ========================================================================== hadoop-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://distributed1:9000/</value> <description>The name of the default file system. Either the literal string "local" or a host:port for DFS.</description> </property> <property> <name>mapred.job.tracker</name> <value>distributed1:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> </property> <property> <name>mapred.tasktracker.tasks.maximum</name> <value>2</value> <description> The maximum number of tasks that will be run simultaneously by a task tracker. This should be adjusted according to the heap size per task, the amount of RAM available, and CPU consumption of each task. </description> </property> <property> <name>mapred.map.tasks</name> <value>799</value> <description> This should be a prime number larger than multiple number of slave hosts, e.g. for 3 nodes set this to 17 </description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value>(test set this value to 4096, but make no difference) <description>The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.</description> </property> <property> <name>mapred.reduce.tasks</name> <value>29</value> <description> This should be a prime number close to a low multiple of slave hosts, e.g. for 3 nodes set this to 7 </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/had/nutch-1.0/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>dfs.name.dir</name> <value>/home/had/nutch-1.0/filesystem/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.data.dir</name> <value>/home/had/nutch-1.0/filesystem/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.</description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx1000m</value> <description> You can specify other Java options for each map or reduce task here, but most likely you will want to adjust the heap size. </description> </property> <property> <name>mapred.system.dir</name> <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> </property> </configuration> ==================================================================== |
|
|
Re: nutch crawldb failed for java heap spaceThese is no one to tell us how to manage a cluster running, so disappointed.
On Fri, Jul 3, 2009 at 12:21 AM, lei wang <nutchmaillist@...> wrote: > Hi,everyone, these days a nutch problem occur to me when I test nutch to > index 2 millions pages. > > When then program steps into the reduce stage of crawldb update, the error > messeges gives as below: > Before this test, I try to crawl and index 1 millions pages, nutch goes > well. > I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even to > my memory size 6GB, but make no different. And change the value of > properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time > and again , but it seems no difference. > > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 > slaves(tasktrack). > I attach the hadoop-env.sh and hadoop-site.xml files after this > messege, appreciate your help very much. > ============================================================ > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.<init>(HashMap.java:209) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) > at > org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > java.lang.OutOfMemoryError: Java heap space > at > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) > at > java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) > at > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > ===================================================================== > hadoop-env.sh > > # Set Hadoop-specific environment variables here. > # The only required environment variable is JAVA_HOME. All others are > # optional. When running a distributed configuration it is best to > # set JAVA_HOME in this file, so that it is correctly defined on > # remote nodes. > # The java implementation to use. Required. > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun > export JAVA_HOME=/usr/lib/jvm/java-6-sun > export HADOOP_HOME=/home/had/nutch-1.0 > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs > export NUTCH_HOME=/home/had/nutch-1.0 > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf > # Extra Java CLASSPATH elements. Optional. > # export HADOOP_CLASSPATH= > # The maximum amount of heap to use, in MB. Default is 1000. > export HADOOP_HEAPSIZE=4000 > export NUTCH_HEAPSIZE=4000 > > ========================================================================== > > hadoop-site.xml > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>fs.default.name</name> > <value>hdfs://distributed1:9000/</value> > <description>The name of the default file system. Either the literal > string "local" or a host:port for DFS.</description> > </property> > <property> > <name>mapred.job.tracker</name> > <value>distributed1:9001</value> > <description>The host and port that the MapReduce job tracker runs at. If > "local", then jobs are run in-process as a single map and reduce > task.</description> > </property> > <property> > <name>mapred.tasktracker.tasks.maximum</name> > <value>2</value> > <description> > The maximum number of tasks that will be run simultaneously by > a task tracker. This should be adjusted according to the heap size > per task, the amount of RAM available, and CPU consumption of each > task. > </description> > </property> > <property> > <name>mapred.map.tasks</name> > <value>799</value> > <description> > This should be a prime number larger than multiple number of slave > hosts, > e.g. for 3 nodes set this to 17 > </description> > </property> > <property> > <name>io.file.buffer.size</name> > <value>131072</value>(test set this value to 4096, but make no > difference) > <description>The size of buffer for use in sequence files. > The size of this buffer should probably be a multiple of hardware > page size (4096 on Intel x86), and it determines how much data is > buffered during read and write operations.</description> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>29</value> > <description> > This should be a prime number close to a low multiple of slave hosts, > e.g. for 3 nodes set this to 7 > </description> > </property> > > <property> > <name>hadoop.tmp.dir</name> > <value>/home/had/nutch-1.0/tmp</value> > <description>A base for other temporary directories.</description> > </property> > <property> > <name>dfs.name.dir</name> > <value>/home/had/nutch-1.0/filesystem/name</value> > <description>Determines where on the local filesystem the DFS name node > should store the name table. If this is a comma-delimited list of > directories then the name table is replicated in all of the directories, for > redundancy. </description> > </property> > <property> > <name>dfs.data.dir</name> > <value>/home/had/nutch-1.0/filesystem/data</value> > <description>Determines where on the local filesystem an DFS data node > should store its blocks. If this is a comma-delimited list of directories, > then data will be stored in all named directories, typically on different > devices. Directories that do not exist are ignored.</description> > </property> > <property> > <name>dfs.replication</name> > <value>1</value> > <description>Default block replication. The actual number of replications > can be specified when the file is created. The default is used if > replication is not specified in create time.</description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1000m</value> > <description> > You can specify other Java options for each map or reduce task here, > but most likely you will want to adjust the heap size. > </description> > </property> > <property> > <name>mapred.system.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> > </property> > <property> > <name>mapred.local.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> > </property> > > </configuration> > ==================================================================== > > > |
|
|
Re: nutch crawldb failed for java heap spaceSee https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce
the memory consumption. You are not setting the heapsize in the right way. This should be done in the hadoop-site.xml using the parameter mapred.child.java.opts. Changing hadoop-env modifies the amount of memory to be used for the services (JobTracker, DataNodes, etc...) but not for the hadoop tasks HTH Julien 2009/7/2 lei wang <nutchmaillist@...> > Hi,everyone, these days a nutch problem occur to me when I test nutch to > index 2 millions pages. > > When then program steps into the reduce stage of crawldb update, the error > messeges gives as below: > Before this test, I try to crawl and index 1 millions pages, nutch goes > well. > I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even to > my memory size 6GB, but make no different. And change the value of > properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB time > and again , but it seems no difference. > > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 > slaves(tasktrack). > I attach the hadoop-env.sh and hadoop-site.xml files after this > messege, appreciate your help very much. > ============================================================ > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.<init>(HashMap.java:209) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) > at > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) > at > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > java.lang.OutOfMemoryError: Java heap space > at > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) > at > > java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) > at > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) > at > > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > ===================================================================== > hadoop-env.sh > > # Set Hadoop-specific environment variables here. > # The only required environment variable is JAVA_HOME. All others are > # optional. When running a distributed configuration it is best to > # set JAVA_HOME in this file, so that it is correctly defined on > # remote nodes. > # The java implementation to use. Required. > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun > export JAVA_HOME=/usr/lib/jvm/java-6-sun > export HADOOP_HOME=/home/had/nutch-1.0 > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs > export NUTCH_HOME=/home/had/nutch-1.0 > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf > # Extra Java CLASSPATH elements. Optional. > # export HADOOP_CLASSPATH= > # The maximum amount of heap to use, in MB. Default is 1000. > export HADOOP_HEAPSIZE=4000 > export NUTCH_HEAPSIZE=4000 > > ========================================================================== > > hadoop-site.xml > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>fs.default.name</name> > <value>hdfs://distributed1:9000/</value> > <description>The name of the default file system. Either the literal > string "local" or a host:port for DFS.</description> > </property> > <property> > <name>mapred.job.tracker</name> > <value>distributed1:9001</value> > <description>The host and port that the MapReduce job tracker runs at. If > "local", then jobs are run in-process as a single map and reduce > task.</description> > </property> > <property> > <name>mapred.tasktracker.tasks.maximum</name> > <value>2</value> > <description> > The maximum number of tasks that will be run simultaneously by > a task tracker. This should be adjusted according to the heap size > per task, the amount of RAM available, and CPU consumption of each task. > </description> > </property> > <property> > <name>mapred.map.tasks</name> > <value>799</value> > <description> > This should be a prime number larger than multiple number of slave > hosts, > e.g. for 3 nodes set this to 17 > </description> > </property> > <property> > <name>io.file.buffer.size</name> > <value>131072</value>(test set this value to 4096, but make no > difference) > <description>The size of buffer for use in sequence files. > The size of this buffer should probably be a multiple of hardware > page size (4096 on Intel x86), and it determines how much data is > buffered during read and write operations.</description> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>29</value> > <description> > This should be a prime number close to a low multiple of slave hosts, > e.g. for 3 nodes set this to 7 > </description> > </property> > > <property> > <name>hadoop.tmp.dir</name> > <value>/home/had/nutch-1.0/tmp</value> > <description>A base for other temporary directories.</description> > </property> > <property> > <name>dfs.name.dir</name> > <value>/home/had/nutch-1.0/filesystem/name</value> > <description>Determines where on the local filesystem the DFS name node > should store the name table. If this is a comma-delimited list of > directories then the name table is replicated in all of the directories, > for > redundancy. </description> > </property> > <property> > <name>dfs.data.dir</name> > <value>/home/had/nutch-1.0/filesystem/data</value> > <description>Determines where on the local filesystem an DFS data node > should store its blocks. If this is a comma-delimited list of directories, > then data will be stored in all named directories, typically on different > devices. Directories that do not exist are ignored.</description> > </property> > <property> > <name>dfs.replication</name> > <value>1</value> > <description>Default block replication. The actual number of replications > can be specified when the file is created. The default is used if > replication is not specified in create time.</description> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1000m</value> > <description> > You can specify other Java options for each map or reduce task here, > but most likely you will want to adjust the heap size. > </description> > </property> > <property> > <name>mapred.system.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> > </property> > <property> > <name>mapred.local.dir</name> > <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> > </property> > > </configuration> > ==================================================================== > -- DigitalPebble Ltd http://www.digitalpebble.com |
|
|
Re: nutch crawldb failed for java heap spaceThanks, i think you are not read my issue carefully.
your suggestion in the reply I have tested for long time. On Sun, Jul 5, 2009 at 9:46 PM, Julien Nioche <lists.digitalpebble@... > wrote: > See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce > the memory consumption. > You are not setting the heapsize in the right way. This should be done in > the hadoop-site.xml using the parameter mapred.child.java.opts. Changing > hadoop-env modifies the amount of memory to be used for the services > (JobTracker, DataNodes, etc...) but not for the hadoop tasks > > HTH > > Julien > > > > 2009/7/2 lei wang <nutchmaillist@...> > > > Hi,everyone, these days a nutch problem occur to me when I test nutch > to > > index 2 millions pages. > > > > When then program steps into the reduce stage of crawldb update, the > error > > messeges gives as below: > > Before this test, I try to crawl and index 1 millions pages, nutch goes > > well. > > I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even > to > > my memory size 6GB, but make no different. And change the value of > > properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB > time > > and again , but it seems no difference. > > > > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 > > slaves(tasktrack). > > I attach the hadoop-env.sh and hadoop-site.xml files after this > > messege, appreciate your help very much. > > ============================================================ > > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > at java.util.HashMap.<init>(HashMap.java:209) > > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) > > at > org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) > > at > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > > at > > > > > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > > at > > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) > > at > org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) > > at > > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) > > at > > > > > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) > > at > > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) > > at > > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > > > > java.lang.OutOfMemoryError: Java heap space > > at > > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) > > at > > > > > java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) > > at > > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) > > at > > java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) > > at > > > > > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) > > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > > at > > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > > at > > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > > ===================================================================== > > hadoop-env.sh > > > > # Set Hadoop-specific environment variables here. > > # The only required environment variable is JAVA_HOME. All others are > > # optional. When running a distributed configuration it is best to > > # set JAVA_HOME in this file, so that it is correctly defined on > > # remote nodes. > > # The java implementation to use. Required. > > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun > > export JAVA_HOME=/usr/lib/jvm/java-6-sun > > export HADOOP_HOME=/home/had/nutch-1.0 > > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf > > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs > > export NUTCH_HOME=/home/had/nutch-1.0 > > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf > > # Extra Java CLASSPATH elements. Optional. > > # export HADOOP_CLASSPATH= > > # The maximum amount of heap to use, in MB. Default is 1000. > > export HADOOP_HEAPSIZE=4000 > > export NUTCH_HEAPSIZE=4000 > > > > > ========================================================================== > > > > hadoop-site.xml > > > > <?xml version="1.0"?> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > > <property> > > <name>fs.default.name</name> > > <value>hdfs://distributed1:9000/</value> > > <description>The name of the default file system. Either the literal > > string "local" or a host:port for DFS.</description> > > </property> > > <property> > > <name>mapred.job.tracker</name> > > <value>distributed1:9001</value> > > <description>The host and port that the MapReduce job tracker runs at. > If > > "local", then jobs are run in-process as a single map and reduce > > task.</description> > > </property> > > <property> > > <name>mapred.tasktracker.tasks.maximum</name> > > <value>2</value> > > <description> > > The maximum number of tasks that will be run simultaneously by > > a task tracker. This should be adjusted according to the heap size > > per task, the amount of RAM available, and CPU consumption of each > task. > > </description> > > </property> > > <property> > > <name>mapred.map.tasks</name> > > <value>799</value> > > <description> > > This should be a prime number larger than multiple number of slave > > hosts, > > e.g. for 3 nodes set this to 17 > > </description> > > </property> > > <property> > > <name>io.file.buffer.size</name> > > <value>131072</value>(test set this value to 4096, but make no > > difference) > > <description>The size of buffer for use in sequence files. > > The size of this buffer should probably be a multiple of hardware > > page size (4096 on Intel x86), and it determines how much data is > > buffered during read and write operations.</description> > > </property> > > <property> > > <name>mapred.reduce.tasks</name> > > <value>29</value> > > <description> > > This should be a prime number close to a low multiple of slave hosts, > > e.g. for 3 nodes set this to 7 > > </description> > > </property> > > > > <property> > > <name>hadoop.tmp.dir</name> > > <value>/home/had/nutch-1.0/tmp</value> > > <description>A base for other temporary directories.</description> > > </property> > > <property> > > <name>dfs.name.dir</name> > > <value>/home/had/nutch-1.0/filesystem/name</value> > > <description>Determines where on the local filesystem the DFS name node > > should store the name table. If this is a comma-delimited list of > > directories then the name table is replicated in all of the directories, > > for > > redundancy. </description> > > </property> > > <property> > > <name>dfs.data.dir</name> > > <value>/home/had/nutch-1.0/filesystem/data</value> > > <description>Determines where on the local filesystem an DFS data node > > should store its blocks. If this is a comma-delimited list of > directories, > > then data will be stored in all named directories, typically on different > > devices. Directories that do not exist are ignored.</description> > > </property> > > <property> > > <name>dfs.replication</name> > > <value>1</value> > > <description>Default block replication. The actual number of > replications > > can be specified when the file is created. The default is used if > > replication is not specified in create time.</description> > > </property> > > <property> > > <name>mapred.child.java.opts</name> > > <value>-Xmx1000m</value> > > <description> > > You can specify other Java options for each map or reduce task here, > > but most likely you will want to adjust the heap size. > > </description> > > </property> > > <property> > > <name>mapred.system.dir</name> > > <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> > > </property> > > <property> > > <name>mapred.local.dir</name> > > <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> > > </property> > > > > </configuration> > > ==================================================================== > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > |
|
|
Re: nutch crawldb failed for java heap spaceWould you give me the detail solution ? Thanks.
On Sun, Jul 5, 2009 at 10:06 PM, lei wang <nutchmaillist@...> wrote: > > Thanks, i think you are not read my issue carefully. > your suggestion in the reply I have tested for long time. > > > On Sun, Jul 5, 2009 at 9:46 PM, Julien Nioche < > lists.digitalpebble@...> wrote: > >> See https://issues.apache.org/jira/browse/NUTCH-702 for a patch to reduce >> the memory consumption. >> You are not setting the heapsize in the right way. This should be done in >> the hadoop-site.xml using the parameter mapred.child.java.opts. Changing >> hadoop-env modifies the amount of memory to be used for the services >> (JobTracker, DataNodes, etc...) but not for the hadoop tasks >> >> HTH >> >> Julien >> >> >> >> 2009/7/2 lei wang <nutchmaillist@...> >> >> > Hi,everyone, these days a nutch problem occur to me when I test nutch >> to >> > index 2 millions pages. >> > >> > When then program steps into the reduce stage of crawldb update, the >> error >> > messeges gives as below: >> > Before this test, I try to crawl and index 1 millions pages, nutch goes >> > well. >> > I alter the HADOOP_HEAPSIZE in the hadoop-env.sh, 1000m,2000m, even >> to >> > my memory size 6GB, but make no different. And change the value of >> > properties "mapred.child.java.opts" by 200m, 1000m, 2000m,4000m,~ 6GB >> time >> > and again , but it seems no difference. >> > >> > I have 9 machines(CPU core 2 2.8GHZ, 6GB RAM,) , 1 master ,and 8 >> > slaves(tasktrack). >> > I attach the hadoop-env.sh and hadoop-site.xml files after this >> > messege, appreciate your help very much. >> > ============================================================ >> > >> > java.lang.OutOfMemoryError: GC overhead limit exceeded >> > at java.util.HashMap.<init>(HashMap.java:209) >> > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:43) >> > at >> org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) >> > at >> > >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >> > at >> > >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >> > at >> > >> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) >> > at >> org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) >> > at >> > >> > >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) >> > at >> > >> > >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) >> > at >> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:70) >> > at >> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> > at org.apache.hadoop.mapred.Child.main(Child.java:158) >> > >> > >> > java.lang.OutOfMemoryError: Java heap space >> > at >> > java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) >> > at >> > >> > >> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) >> > at >> > >> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) >> > at >> > >> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) >> > at >> > >> > >> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:46) >> > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >> > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) >> > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >> > at >> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) >> > at >> > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> > at org.apache.hadoop.mapred.Child.main(Child.java:158) >> > >> > ===================================================================== >> > hadoop-env.sh >> > >> > # Set Hadoop-specific environment variables here. >> > # The only required environment variable is JAVA_HOME. All others are >> > # optional. When running a distributed configuration it is best to >> > # set JAVA_HOME in this file, so that it is correctly defined on >> > # remote nodes. >> > # The java implementation to use. Required. >> > # export JAVA_HOME=/usr/lib/j2sdk1.5-sun >> > export JAVA_HOME=/usr/lib/jvm/java-6-sun >> > export HADOOP_HOME=/home/had/nutch-1.0 >> > export HADOOP_CONF_DIR=/home/had/nutch-1.0/conf >> > export HADOOP_LOG_DIR=/home/had/nutch-1.0/logs >> > export NUTCH_HOME=/home/had/nutch-1.0 >> > export NUTCH_CONF_DIR=/home/had/nutch-1.0/conf >> > # Extra Java CLASSPATH elements. Optional. >> > # export HADOOP_CLASSPATH= >> > # The maximum amount of heap to use, in MB. Default is 1000. >> > export HADOOP_HEAPSIZE=4000 >> > export NUTCH_HEAPSIZE=4000 >> > >> > >> ========================================================================== >> > >> > hadoop-site.xml >> > >> > <?xml version="1.0"?> >> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> > <!-- Put site-specific property overrides in this file. --> >> > <configuration> >> > <property> >> > <name>fs.default.name</name> >> > <value>hdfs://distributed1:9000/</value> >> > <description>The name of the default file system. Either the literal >> > string "local" or a host:port for DFS.</description> >> > </property> >> > <property> >> > <name>mapred.job.tracker</name> >> > <value>distributed1:9001</value> >> > <description>The host and port that the MapReduce job tracker runs at. >> If >> > "local", then jobs are run in-process as a single map and reduce >> > task.</description> >> > </property> >> > <property> >> > <name>mapred.tasktracker.tasks.maximum</name> >> > <value>2</value> >> > <description> >> > The maximum number of tasks that will be run simultaneously by >> > a task tracker. This should be adjusted according to the heap size >> > per task, the amount of RAM available, and CPU consumption of each >> task. >> > </description> >> > </property> >> > <property> >> > <name>mapred.map.tasks</name> >> > <value>799</value> >> > <description> >> > This should be a prime number larger than multiple number of slave >> > hosts, >> > e.g. for 3 nodes set this to 17 >> > </description> >> > </property> >> > <property> >> > <name>io.file.buffer.size</name> >> > <value>131072</value>(test set this value to 4096, but make no >> > difference) >> > <description>The size of buffer for use in sequence files. >> > The size of this buffer should probably be a multiple of hardware >> > page size (4096 on Intel x86), and it determines how much data is >> > buffered during read and write operations.</description> >> > </property> >> > <property> >> > <name>mapred.reduce.tasks</name> >> > <value>29</value> >> > <description> >> > This should be a prime number close to a low multiple of slave hosts, >> > e.g. for 3 nodes set this to 7 >> > </description> >> > </property> >> > >> > <property> >> > <name>hadoop.tmp.dir</name> >> > <value>/home/had/nutch-1.0/tmp</value> >> > <description>A base for other temporary directories.</description> >> > </property> >> > <property> >> > <name>dfs.name.dir</name> >> > <value>/home/had/nutch-1.0/filesystem/name</value> >> > <description>Determines where on the local filesystem the DFS name node >> > should store the name table. If this is a comma-delimited list of >> > directories then the name table is replicated in all of the directories, >> > for >> > redundancy. </description> >> > </property> >> > <property> >> > <name>dfs.data.dir</name> >> > <value>/home/had/nutch-1.0/filesystem/data</value> >> > <description>Determines where on the local filesystem an DFS data node >> > should store its blocks. If this is a comma-delimited list of >> directories, >> > then data will be stored in all named directories, typically on >> different >> > devices. Directories that do not exist are ignored.</description> >> > </property> >> > <property> >> > <name>dfs.replication</name> >> > <value>1</value> >> > <description>Default block replication. The actual number of >> replications >> > can be specified when the file is created. The default is used if >> > replication is not specified in create time.</description> >> > </property> >> > <property> >> > <name>mapred.child.java.opts</name> >> > <value>-Xmx1000m</value> >> > <description> >> > You can specify other Java options for each map or reduce task here, >> > but most likely you will want to adjust the heap size. >> > </description> >> > </property> >> > <property> >> > <name>mapred.system.dir</name> >> > <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value> >> > </property> >> > <property> >> > <name>mapred.local.dir</name> >> > <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value> >> > </property> >> > >> > </configuration> >> > ==================================================================== >> > >> >> >> >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> > > |
| Free embeddable forum powered by Nabble | Forum Help |