|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
MergeSegments - map reduce thread deathHi there,
seems i have some serious problems with hadoop during map-reduce for MergeSegments. i am out of ideas on this. Any suggestions will be quite welcome. Here is my set up: RAM: 4G JVM HEAP: 2G mapred.child.java.opts = 1024M hadoop-0.19.1-core.jar nutch-1.0 Xen VPS. After running a recrawl a few times; i end up with one segment that is relatively larger compared to the new ones last generated. here is my segments structure when things blow up after a (5th) recrawl; segment1 = 674Megs (after several recrawls) segment2 = 580k (last recrawl) segment3 = 568k (last recrawl) segment4 = 584k (last recrawl) .. segment8 = 560k (last recrawl) when i run mergeSegments everything goes well until we get up to 90% of the map-reduce and we get a thread death; here is a stack trace 2009-11-05 10:54:16,874 INFO [org.apache.hadoop.mapred.LocalJobRunner] reduce > reduce 2009-11-05 10:54:29,794 INFO [org.apache.hadoop.mapred.LocalJobRunner] reduce > reduce 2009-11-05 10:54:55,194 INFO [org.apache.hadoop.mapred.LocalJobRunner] reduce > reduce 2009-11-05 10:57:25,844 WARN [org.apache.hadoop.mapred.LocalJobRunner] job_local_0001 java.lang.ThreadDeath at java.lang.Thread.stop(Thread.java:715) at org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) at org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) any suggestions please!!!! thanks. |
|
|
Re: MergeSegments - map reduce thread deathfadzi@... wrote:
> Hi there, > > seems i have some serious problems with hadoop during map-reduce for > MergeSegments. > > i am out of ideas on this. Any suggestions will be quite welcome. > > Here is my set up: > > RAM: 4G > JVM HEAP: 2G > mapred.child.java.opts = 1024M > hadoop-0.19.1-core.jar > nutch-1.0 > Xen VPS. > > After running a recrawl a few times; i end up with one segment that is > relatively larger compared to the new ones last generated. here is my > segments structure when things blow up after a (5th) recrawl; > > segment1 = 674Megs (after several recrawls) > segment2 = 580k (last recrawl) > segment3 = 568k (last recrawl) > segment4 = 584k (last recrawl) > .. > segment8 = 560k (last recrawl) > > when i run mergeSegments everything goes well until we get up to 90% of > the map-reduce and we get a thread death; here is a stack trace > > 2009-11-05 10:54:16,874 INFO [org.apache.hadoop.mapred.LocalJobRunner] > reduce > reduce > 2009-11-05 10:54:29,794 INFO [org.apache.hadoop.mapred.LocalJobRunner] > reduce > reduce > 2009-11-05 10:54:55,194 INFO [org.apache.hadoop.mapred.LocalJobRunner] > reduce > reduce > 2009-11-05 10:57:25,844 WARN [org.apache.hadoop.mapred.LocalJobRunner] > job_local_0001 > java.lang.ThreadDeath > at java.lang.Thread.stop(Thread.java:715) > at > org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) > at > org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) > at > org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) > at > org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) > > any suggestions please!!!! This is a high-level exception that doesn't indicate the nature of the original problem. Is there any other information in hadoop.log or in task logs (logs/userlogs)? In my experience this sort of things happen rarely, for the relatively small dataset that you have, so you are lucky ;) This could be related to a number of issues, like running this under Xen that imposes some limits and slowdowns, or you may have a low number of file descriptors (ulimit -n), or a faulty RAM, or an overheated CPU ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: MergeSegments - map reduce thread deathI suggest maybe turning the debug logs on for hadoop before you do the
next crawl... you can do this by editing log4j.properties and change the rootLogger from INFO to DEBUG On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab@...> wrote: > fadzi@... wrote: >> >> Hi there, >> >> seems i have some serious problems with hadoop during map-reduce for >> MergeSegments. >> >> i am out of ideas on this. Any suggestions will be quite welcome. >> >> Here is my set up: >> >> RAM: 4G >> JVM HEAP: 2G >> mapred.child.java.opts = 1024M >> hadoop-0.19.1-core.jar >> nutch-1.0 >> Xen VPS. >> >> After running a recrawl a few times; i end up with one segment that is >> relatively larger compared to the new ones last generated. here is my >> segments structure when things blow up after a (5th) recrawl; >> >> segment1 = 674Megs (after several recrawls) >> segment2 = 580k (last recrawl) >> segment3 = 568k (last recrawl) >> segment4 = 584k (last recrawl) >> .. >> segment8 = 560k (last recrawl) >> >> when i run mergeSegments everything goes well until we get up to 90% of >> the map-reduce and we get a thread death; here is a stack trace >> >> 2009-11-05 10:54:16,874 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> reduce > reduce >> 2009-11-05 10:54:29,794 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> reduce > reduce >> 2009-11-05 10:54:55,194 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> reduce > reduce >> 2009-11-05 10:57:25,844 WARN [org.apache.hadoop.mapred.LocalJobRunner] >> job_local_0001 >> java.lang.ThreadDeath >> at java.lang.Thread.stop(Thread.java:715) >> at >> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) >> at >> >> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) >> at >> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) >> at >> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) >> >> any suggestions please!!!! > > This is a high-level exception that doesn't indicate the nature of the > original problem. Is there any other information in hadoop.log or in task > logs (logs/userlogs)? > > In my experience this sort of things happen rarely, for the relatively small > dataset that you have, so you are lucky ;) This could be related to a number > of issues, like running this under Xen that imposes some limits and > slowdowns, or you may have a low number of file descriptors (ulimit -n), or > a faulty RAM, or an overheated CPU ... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > |
|
|
Re: MergeSegments - map reduce thread deathi tried this once but before i knew it my log file was approaching a gig
within an hour or so! > I suggest maybe turning the debug logs on for hadoop before you do the > next crawl... you can do this by editing log4j.properties > and change the rootLogger from INFO to DEBUG > > On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab@...> wrote: >> fadzi@... wrote: >>> >>> Hi there, >>> >>> seems i have some serious problems with hadoop during map-reduce for >>> MergeSegments. >>> >>> i am out of ideas on this. Any suggestions will be quite welcome. >>> >>> Here is my set up: >>> >>> RAM: 4G >>> JVM HEAP: 2G >>> mapred.child.java.opts = 1024M >>> hadoop-0.19.1-core.jar >>> nutch-1.0 >>> Xen VPS. >>> >>> After running a recrawl a few times; i end up with one segment that is >>> relatively larger compared to the new ones last generated. here is my >>> segments structure when things blow up after a (5th) recrawl; >>> >>> segment1 = 674Megs (after several recrawls) >>> segment2 = 580k (last recrawl) >>> segment3 = 568k (last recrawl) >>> segment4 = 584k (last recrawl) >>> .. >>> segment8 = 560k (last recrawl) >>> >>> when i run mergeSegments everything goes well until we get up to 90% of >>> the map-reduce and we get a thread death; here is a stack trace >>> >>> 2009-11-05 10:54:16,874 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:54:29,794 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:54:55,194 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:57:25,844 WARN [org.apache.hadoop.mapred.LocalJobRunner] >>> job_local_0001 >>> java.lang.ThreadDeath >>> at java.lang.Thread.stop(Thread.java:715) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) >>> at >>> >>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) >>> at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) >>> at >>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) >>> at >>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) >>> >>> any suggestions please!!!! >> >> This is a high-level exception that doesn't indicate the nature of the >> original problem. Is there any other information in hadoop.log or in >> task >> logs (logs/userlogs)? >> >> In my experience this sort of things happen rarely, for the relatively >> small >> dataset that you have, so you are lucky ;) This could be related to a >> number >> of issues, like running this under Xen that imposes some limits and >> slowdowns, or you may have a low number of file descriptors (ulimit -n), >> or >> a faulty RAM, or an overheated CPU ... >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> > |
|
|
Re: MergeSegments - map reduce thread deathhi there,
we tried a few things around this; one suggestion was to run on it on a local machine; so i pulled one of our decent servers and got to work... but surprisingly we got the same error on a local machine! so it seems the hardware (VPS/Local) wasnt the culprit.. probably the data, or the code. so we decided to discard the db and generate a new one - things seem to be working normally so far.. but lets see when db becomes larger. having said that - there were a few things we found out and need clarification whether they were a cause for problems or not; here is the scenario - in sequence of execution; step 1 setup. * first crawl was done using "bin/nutch crawl.." - urls = 1500 - depth = 10 - topN = 500 (so it should do all by round 3 right? what happens at rounds 4 to 10?) step 2 to 5 setup. * recrawl (repeat) - topN = 10000 - depth = 10 - db.default.fetch.interval = 30 (doesnt seem to do anything) - generate.update.crawldb = false (same fetchlist was being generated) - injected seed urls again (bad! we didnt realise this was happening, but whats the effect of doing this?) - fetch - update db (this step above was an effort to get an incremental crawl.. ) step 6 * merge segments, invertlinks, indexes... - at this stage map reduce just died during MergeSegments, ..with an out of heap memory exception. the assumption was with a seed url list of 1500, nutch will generate more NEW urls from the crawldb based on the outlinks it found - is this true? because it did not seem to be the case. also what is the effect of running a recrawl with using topN more than what nutch can generate? > i tried this once but before i knew it my log file was approaching a gig > within an hour or so! > > >> I suggest maybe turning the debug logs on for hadoop before you do the >> next crawl... you can do this by editing log4j.properties >> and change the rootLogger from INFO to DEBUG >> >> On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab@...> wrote: >>> fadzi@... wrote: >>>> >>>> Hi there, >>>> >>>> seems i have some serious problems with hadoop during map-reduce for >>>> MergeSegments. >>>> >>>> i am out of ideas on this. Any suggestions will be quite welcome. >>>> >>>> Here is my set up: >>>> >>>> RAM: 4G >>>> JVM HEAP: 2G >>>> mapred.child.java.opts = 1024M >>>> hadoop-0.19.1-core.jar >>>> nutch-1.0 >>>> Xen VPS. >>>> >>>> After running a recrawl a few times; i end up with one segment that is >>>> relatively larger compared to the new ones last generated. here is my >>>> segments structure when things blow up after a (5th) recrawl; >>>> >>>> segment1 = 674Megs (after several recrawls) >>>> segment2 = 580k (last recrawl) >>>> segment3 = 568k (last recrawl) >>>> segment4 = 584k (last recrawl) >>>> .. >>>> segment8 = 560k (last recrawl) >>>> >>>> when i run mergeSegments everything goes well until we get up to 90% >>>> of >>>> the map-reduce and we get a thread death; here is a stack trace >>>> >>>> 2009-11-05 10:54:16,874 INFO >>>> [org.apache.hadoop.mapred.LocalJobRunner] >>>> reduce > reduce >>>> 2009-11-05 10:54:29,794 INFO >>>> [org.apache.hadoop.mapred.LocalJobRunner] >>>> reduce > reduce >>>> 2009-11-05 10:54:55,194 INFO >>>> [org.apache.hadoop.mapred.LocalJobRunner] >>>> reduce > reduce >>>> 2009-11-05 10:57:25,844 WARN >>>> [org.apache.hadoop.mapred.LocalJobRunner] >>>> job_local_0001 >>>> java.lang.ThreadDeath >>>> at java.lang.Thread.stop(Thread.java:715) >>>> at >>>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) >>>> at >>>> >>>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) >>>> at >>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) >>>> at >>>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) >>>> at >>>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) >>>> >>>> any suggestions please!!!! >>> >>> This is a high-level exception that doesn't indicate the nature of the >>> original problem. Is there any other information in hadoop.log or in >>> task >>> logs (logs/userlogs)? >>> >>> In my experience this sort of things happen rarely, for the relatively >>> small >>> dataset that you have, so you are lucky ;) This could be related to a >>> number >>> of issues, like running this under Xen that imposes some limits and >>> slowdowns, or you may have a low number of file descriptors (ulimit >>> -n), >>> or >>> a faulty RAM, or an overheated CPU ... >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >> > > > |
| Free embeddable forum powered by Nabble | Forum Help |