|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
updatedb is talking long long timehi everyone
I'm using nutch 1.0. I have fetched successfully and currently on the updatedb process. I'm doing updatedb and its taking so long. I don't know why its taking this long. I have a new machine with quad core processor and 8 gb of ram. I believe this system is really good in terms of processing power. I don't think processing power is the problem here. I noticed that all the ram is getting using up. close to 7.7gb by the updatedb process. The computer is becoming is really slow. The updatedb process has been running for the last 19 days continually with the message merging segment data into db.. Does anyone know why its taking so long... Is there any configuration setting i can do to increase the speed of the updatedb process... Thanks in advance for any help... Mathan root@trweb10:/opt/nutch-1.0# bin/nutch updatedb Using configuration below /opt/****/jdk1.6.0_16 Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions] crawldb CrawlDb to update -dir segments parent directory containing all segments to update from seg1 seg2 ... list of segment names to update from -force force update even if CrawlDb appears to be locked (CAUTION advised) -normalize use URLNormalizer on urls in CrawlDb and segment (usually not needed) -filter use URLFilters on urls in CrawlDb and segment -noAdditions only update already existing URLs, don't add any newly discovered URLs root@trweb10:/opt/nutch-1.0# bin/nutch updatedb Crawl/db/ Crawl/segments/200909 20090906232208/ 20090909074026/ 20090909101115/ 20090909124554/ 20090914115913/ 20090915141615/ root@trweb10:/opt/tsweb/nutch-1.0# bin/nutch updatedb Crawl/db/ Crawl/segments/20090915141615/ -force Using configuration below conf_tamilsweb /opt/tsweb/jdk1.6.0_16 CrawlDb update: starting CrawlDb update: db: Crawl/db CrawlDb update: segments: [Crawl/segments/20090915141615] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: Merging segment data into db. |
|
|
Re: updatedb is talking long long timeI forgot to add the detail...
The segment i'm trying to do updatedb on has 1.3 millions urls fetched and 1.08 million urls parsed.. Any help related to this would be appreciated... On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran <mathan55@...> wrote: > hi everyone > > I'm using nutch 1.0. I have fetched successfully and currently on the > updatedb process. I'm doing updatedb and its taking so long. I don't > know why its taking this long. I have a new machine with quad core > processor and 8 gb of ram. > > I believe this system is really good in terms of processing power. I > don't think processing power is the problem here. I noticed that all > the ram is getting using up. close to 7.7gb by the updatedb process. > The computer is becoming is really slow. > > The updatedb process has been running for the last 19 days continually > with the message merging segment data into db.. Does anyone know why > its taking so long... Is there any configuration setting i can do to > increase the speed of the updatedb process... > > Thanks in advance for any help... > Mathan > > root@trweb10:/opt/nutch-1.0# bin/nutch updatedb > Using configuration below > /opt/****/jdk1.6.0_16 > Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) > [-force] [-normalize] [-filter] [-noAdditions] > crawldb CrawlDb to update > -dir segments parent directory containing all segments to update from > seg1 seg2 ... list of segment names to update from > -force force update even if CrawlDb appears to be locked > (CAUTION advised) > -normalize use URLNormalizer on urls in CrawlDb and > segment (usually not needed) > -filter use URLFilters on urls in CrawlDb and segment > -noAdditions only update already existing URLs, don't add > any newly discovered URLs > root@trweb10:/opt/nutch-1.0# bin/nutch updatedb Crawl/db/ Crawl/segments/200909 > 20090906232208/ 20090909074026/ 20090909101115/ 20090909124554/ > 20090914115913/ 20090915141615/ > root@trweb10:/opt/tsweb/nutch-1.0# bin/nutch updatedb Crawl/db/ > Crawl/segments/20090915141615/ -force > Using configuration below > conf_tamilsweb > /opt/tsweb/jdk1.6.0_16 > CrawlDb update: starting > CrawlDb update: db: Crawl/db > CrawlDb update: segments: [Crawl/segments/20090915141615] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: false > CrawlDb update: URL filtering: false > CrawlDb update: Merging segment data into db. > |
|
|
Re: updatedb is talking long long timeCould you dump a stack trace of the process? That would give an idea of
where it is stuck. How large is your crawlDB? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/2 Kalaimathan Mahenthiran <mathan55@...> > I forgot to add the detail... > > The segment i'm trying to do updatedb on has 1.3 millions urls fetched > and 1.08 million urls parsed.. > > Any help related to this would be appreciated... > > > On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > <mathan55@...> wrote: > > hi everyone > > > > I'm using nutch 1.0. I have fetched successfully and currently on the > > updatedb process. I'm doing updatedb and its taking so long. I don't > > know why its taking this long. I have a new machine with quad core > > processor and 8 gb of ram. > > > > I believe this system is really good in terms of processing power. I > > don't think processing power is the problem here. I noticed that all > > the ram is getting using up. close to 7.7gb by the updatedb process. > > The computer is becoming is really slow. > > > > The updatedb process has been running for the last 19 days continually > > with the message merging segment data into db.. Does anyone know why > > its taking so long... Is there any configuration setting i can do to > > increase the speed of the updatedb process... > > > > Thanks in advance for any help... > > Mathan > > > > root@trweb10:/opt/nutch-1.0# bin/nutch updatedb > > Using configuration below > > /opt/****/jdk1.6.0_16 > > Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) > > [-force] [-normalize] [-filter] [-noAdditions] > > crawldb CrawlDb to update > > -dir segments parent directory containing all segments to update > from > > seg1 seg2 ... list of segment names to update from > > -force force update even if CrawlDb appears to be locked > > (CAUTION advised) > > -normalize use URLNormalizer on urls in CrawlDb and > > segment (usually not needed) > > -filter use URLFilters on urls in CrawlDb and segment > > -noAdditions only update already existing URLs, don't add > > any newly discovered URLs > > root@trweb10:/opt/nutch-1.0# bin/nutch updatedb Crawl/db/ > Crawl/segments/200909 > > 20090906232208/ 20090909074026/ 20090909101115/ 20090909124554/ > > 20090914115913/ 20090915141615/ > > root@trweb10:/opt/tsweb/nutch-1.0# bin/nutch updatedb Crawl/db/ > > Crawl/segments/20090915141615/ -force > > Using configuration below > > conf_tamilsweb > > /opt/tsweb/jdk1.6.0_16 > > CrawlDb update: starting > > CrawlDb update: db: Crawl/db > > CrawlDb update: segments: [Crawl/segments/20090915141615] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > CrawlDb update: Merging segment data into db. > > > |
|
|
Re: updatedb is talking long long timeKalaimathan Mahenthiran wrote:
> I forgot to add the detail... > > The segment i'm trying to do updatedb on has 1.3 millions urls fetched > and 1.08 million urls parsed.. > > Any help related to this would be appreciated... > > > On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > <mathan55@...> wrote: >> hi everyone >> >> I'm using nutch 1.0. I have fetched successfully and currently on the >> updatedb process. I'm doing updatedb and its taking so long. I don't >> know why its taking this long. I have a new machine with quad core >> processor and 8 gb of ram. >> >> I believe this system is really good in terms of processing power. I >> don't think processing power is the problem here. I noticed that all >> the ram is getting using up. close to 7.7gb by the updatedb process. >> The computer is becoming is really slow. >> >> The updatedb process has been running for the last 19 days continually >> with the message merging segment data into db.. Does anyone know why >> its taking so long... Is there any configuration setting i can do to >> increase the speed of the updatedb process... First, this process normally takes just a few minutes, depending on the hardware, and not several days - so something is wrong. * do you run this in "local" or pseudo-distributed mode (i.e. running a real jobtracker and tasktracker?) Try the pseudo-distributed mode, because then you can monitor the progress in the web UI. * how many reduce tasks do you have? with large updates it helps if you run > 1 reducer, to split the final sorting. * if the task appears to be completely stuck, please generate a thread dump (kill -SIGQUIT) and see where it's stuck. This could be related to urlfilter-regex or urlnormalizer-regex - you can identify if these are problematic by removing them from the config and re-running the operation. * minor issue - when specifying the path names of segments and crawldb, do NOT append the trailing slash - it's not harmful in this particular case, but you could have a nasty surprise when doing e.g. copy / mv operations ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: updatedb is talking long long timeThanks for all the replies...
Okay, I think there seems to be some issue too... I'm running nutch out of the box.. using nutch release 1.0... I running this in "local" mode.. The number of reduce tasks.. is the default configured by nutch... The db size is approximately 860 mb.. i know the process is not stuck.. and the process is running because i turned on the hadoop logs and i can see logs being written to it... I'm not sure how to check if the task is completely stuck or not... Below is the sample log as i'm sending this email.... Its been on the updatedb process for the last 19 days and the it has been generating debug logs similar to this........ Has anyone else has this same issue before... 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$FileSystemCounter with bundle 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: record full = true 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; bufend = 13893589; bufvoid = 99614720 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend = 65533; length = 327680 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map output 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last merge-pass, with 5 segments left of total size: 1192 bytes 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - attempt_local_0001_m_000003_0 Progress/ping thread exiting since it got interrupted 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$FileSystemCounter with bundle 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task 'attempt_local_0001_m_000003_0' done. 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 2009-11-02 13:34:23,397 DEBUG conf.Configuration - java.io.IOException: config(config) at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to /tmp/hadoop-root/mapred/local/localRunner/split.dta 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - attempt_local_0001_m_000004_0 Progress/ping thread started 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% Mathan On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> wrote: > Kalaimathan Mahenthiran wrote: >> >> I forgot to add the detail... >> >> The segment i'm trying to do updatedb on has 1.3 millions urls fetched >> and 1.08 million urls parsed.. >> >> Any help related to this would be appreciated... >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >> <mathan55@...> wrote: >>> >>> hi everyone >>> >>> I'm using nutch 1.0. I have fetched successfully and currently on the >>> updatedb process. I'm doing updatedb and its taking so long. I don't >>> know why its taking this long. I have a new machine with quad core >>> processor and 8 gb of ram. >>> >>> I believe this system is really good in terms of processing power. I >>> don't think processing power is the problem here. I noticed that all >>> the ram is getting using up. close to 7.7gb by the updatedb process. >>> The computer is becoming is really slow. >>> >>> The updatedb process has been running for the last 19 days continually >>> with the message merging segment data into db.. Does anyone know why >>> its taking so long... Is there any configuration setting i can do to >>> increase the speed of the updatedb process... > > First, this process normally takes just a few minutes, depending on the > hardware, and not several days - so something is wrong. > > * do you run this in "local" or pseudo-distributed mode (i.e. running a real > jobtracker and tasktracker?) Try the pseudo-distributed mode, because then > you can monitor the progress in the web UI. > > * how many reduce tasks do you have? with large updates it helps if you run >> 1 reducer, to split the final sorting. > > * if the task appears to be completely stuck, please generate a thread dump > (kill -SIGQUIT) and see where it's stuck. This could be related to > urlfilter-regex or urlnormalizer-regex - you can identify if these are > problematic by removing them from the config and re-running the operation. > > * minor issue - when specifying the path names of segments and crawldb, do > NOT append the trailing slash - it's not harmful in this particular case, > but you could have a nasty surprise when doing e.g. copy / mv operations ... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > |
|
|
Re: updatedb is talking long long timeHi again
> i know the process is not stuck.. and the process is running because i > turned on the hadoop logs and i can see logs being written to it... > I'm not sure how to check if the task is completely stuck or not... > run jps to identify the process id then *jstack id* several times to see if it is blocked at the same place how much space do you have left on the partition where /tmp is mounted? J. > Below is the sample log as i'm sending this email.... Its been on the > updatedb process for the last 19 days and the it has been generating > debug logs similar to this........ > > Has anyone else has this same issue before... > > > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > org.apache.hadoop.mapred.Task$Counter with bundle > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > COMBINE_OUTPUT_RECORDS > 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS > 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > COMBINE_INPUT_RECORDS > 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% > 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: > record full = true > 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; > bufend = 13893589; bufvoid = 99614720 > 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend > = 65533; length = 327680 > 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 > 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map output > 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 > 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) > 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) > 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) > 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) > 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) > 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments > 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last > merge-pass, with 5 segments left of total size: 1192 bytes > 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) > 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - > Task:attempt_local_0001_m_000003_0 is done. And is in the process of > commiting > 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - > attempt_local_0001_m_000003_0 Progress/ping thread exiting since it > got interrupted > 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - > > file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > org.apache.hadoop.mapred.Task$Counter with bundle > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > COMBINE_OUTPUT_RECORDS > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS > 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > COMBINE_INPUT_RECORDS > 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task > 'attempt_local_0001_m_000003_0' done. > 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 > 2009-11-02 13:34:23,397 DEBUG conf.Configuration - > java.io.IOException: config(config) > at > org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) > at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) > > 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to > /tmp/hadoop-root/mapred/local/localRunner/split.dta > 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - > attempt_local_0001_m_000004_0 Progress/ping thread started > 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 > 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 > 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% > > Mathan > On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> wrote: > > Kalaimathan Mahenthiran wrote: > >> > >> I forgot to add the detail... > >> > >> The segment i'm trying to do updatedb on has 1.3 millions urls fetched > >> and 1.08 million urls parsed.. > >> > >> Any help related to this would be appreciated... > >> > >> > >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > >> <mathan55@...> wrote: > >>> > >>> hi everyone > >>> > >>> I'm using nutch 1.0. I have fetched successfully and currently on the > >>> updatedb process. I'm doing updatedb and its taking so long. I don't > >>> know why its taking this long. I have a new machine with quad core > >>> processor and 8 gb of ram. > >>> > >>> I believe this system is really good in terms of processing power. I > >>> don't think processing power is the problem here. I noticed that all > >>> the ram is getting using up. close to 7.7gb by the updatedb process. > >>> The computer is becoming is really slow. > >>> > >>> The updatedb process has been running for the last 19 days continually > >>> with the message merging segment data into db.. Does anyone know why > >>> its taking so long... Is there any configuration setting i can do to > >>> increase the speed of the updatedb process... > > > > First, this process normally takes just a few minutes, depending on the > > hardware, and not several days - so something is wrong. > > > > * do you run this in "local" or pseudo-distributed mode (i.e. running a > real > > jobtracker and tasktracker?) Try the pseudo-distributed mode, because > then > > you can monitor the progress in the web UI. > > > > * how many reduce tasks do you have? with large updates it helps if you > run > >> 1 reducer, to split the final sorting. > > > > * if the task appears to be completely stuck, please generate a thread > dump > > (kill -SIGQUIT) and see where it's stuck. This could be related to > > urlfilter-regex or urlnormalizer-regex - you can identify if these are > > problematic by removing them from the config and re-running the > operation. > > > > * minor issue - when specifying the path names of segments and crawldb, > do > > NOT append the trailing slash - it's not harmful in this particular case, > > but you could have a nasty surprise when doing e.g. copy / mv operations > ... > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com |
|
|
Re: updatedb is talking long long timeI have lot of space left on the /tmp . I don't have separate partition
for /tmp... i have a folder called /tmp... There is lot of space left.. close to 1.3Terabytes... 1.4T 55G 1.3T 5% / tmpfs 3.8G 0 3.8G 0% /lib/init/rw varrun 3.8G 120K 3.8G 1% /var/run varlock 3.8G 0 3.8G 0% /var/lock udev 3.8G 152K 3.8G 1% /dev tmpfs 3.8G 0 3.8G 0% /dev/shm lrm 3.8G 2.5M 3.8G 1% /lib/modules/2.6.28-15-server/volatile /dev/sda5 228M 29M 187M 14% /boot /dev/sr0 388K 388K 0 100% /media/cdrom0 I also noticed that /tmp/hadoop-root directory is 6.8 Gb... I have attached the jstack of the process that is doing the update.... below 2009-11-02 19:11:54 Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition [0x00007f4000e97000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Task$1.run(Task.java:403) at java.lang.Thread.run(Thread.java:619) "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable [0x00007f4000f98000] java.lang.Thread.State: RUNNABLE at java.lang.Byte.hashCode(Byte.java:394) at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) at org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) - locked <0x00007f47ef4d9310> (a org.apache.hadoop.io.MapWritable) at org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in Object.wait() [0x00007f40016f7000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007f400f63e6c0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) - locked <0x00007f400f63e6c0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb in Object.wait() [0x00007f40017f8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition [0x00007f49d75c2000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 nid=0x25c6 runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 nid=0x25c7 runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 nid=0x25c8 runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 nid=0x25c9 runnable "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 waiting on condition JNI global references: 907 Any help related to this would be really helpful... On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche <lists.digitalpebble@...> wrote: > Hi again > > >> i know the process is not stuck.. and the process is running because i >> turned on the hadoop logs and i can see logs being written to it... >> I'm not sure how to check if the task is completely stuck or not... >> > > run jps to identify the process id then *jstack id* several times to see if > it is blocked at the same place > > how much space do you have left on the partition where /tmp is mounted? > > J. > > > >> Below is the sample log as i'm sending this email.... Its been on the >> updatedb process for the last 19 days and the it has been generating >> debug logs similar to this........ >> >> Has anyone else has this same issue before... >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> org.apache.hadoop.mapred.Task$Counter with bundle >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >> COMBINE_OUTPUT_RECORDS >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> COMBINE_INPUT_RECORDS >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: >> record full = true >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; >> bufend = 13893589; bufvoid = 99614720 >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend >> = 65533; length = 327680 >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map output >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last >> merge-pass, with 5 segments left of total size: 1192 bytes >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of >> commiting >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it >> got interrupted >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - >> >> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> org.apache.hadoop.mapred.Task$Counter with bundle >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> COMBINE_OUTPUT_RECORDS >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> COMBINE_INPUT_RECORDS >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task >> 'attempt_local_0001_m_000003_0' done. >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - >> java.io.IOException: config(config) >> at >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to >> /tmp/hadoop-root/mapred/local/localRunner/split.dta >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - >> attempt_local_0001_m_000004_0 Progress/ping thread started >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% >> >> Mathan >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> wrote: >> > Kalaimathan Mahenthiran wrote: >> >> >> >> I forgot to add the detail... >> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls fetched >> >> and 1.08 million urls parsed.. >> >> >> >> Any help related to this would be appreciated... >> >> >> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >> >> <mathan55@...> wrote: >> >>> >> >>> hi everyone >> >>> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on the >> >>> updatedb process. I'm doing updatedb and its taking so long. I don't >> >>> know why its taking this long. I have a new machine with quad core >> >>> processor and 8 gb of ram. >> >>> >> >>> I believe this system is really good in terms of processing power. I >> >>> don't think processing power is the problem here. I noticed that all >> >>> the ram is getting using up. close to 7.7gb by the updatedb process. >> >>> The computer is becoming is really slow. >> >>> >> >>> The updatedb process has been running for the last 19 days continually >> >>> with the message merging segment data into db.. Does anyone know why >> >>> its taking so long... Is there any configuration setting i can do to >> >>> increase the speed of the updatedb process... >> > >> > First, this process normally takes just a few minutes, depending on the >> > hardware, and not several days - so something is wrong. >> > >> > * do you run this in "local" or pseudo-distributed mode (i.e. running a >> real >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, because >> then >> > you can monitor the progress in the web UI. >> > >> > * how many reduce tasks do you have? with large updates it helps if you >> run >> >> 1 reducer, to split the final sorting. >> > >> > * if the task appears to be completely stuck, please generate a thread >> dump >> > (kill -SIGQUIT) and see where it's stuck. This could be related to >> > urlfilter-regex or urlnormalizer-regex - you can identify if these are >> > problematic by removing them from the config and re-running the >> operation. >> > >> > * minor issue - when specifying the path names of segments and crawldb, >> do >> > NOT append the trailing slash - it's not harmful in this particular case, >> > but you could have a nasty surprise when doing e.g. copy / mv operations >> ... >> > >> > -- >> > Best regards, >> > Andrzej Bialecki <>< >> > ___. ___ ___ ___ _ _ __________________________________ >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> > ___|||__|| \| || | Embedded Unix, System Integration >> > http://www.sigram.com Contact: info at sigram dot com >> > >> > >> > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > |
|
|
Re: updatedb is talking long long timeOK. What heapsize did you specify for this job? Could it be that you are
running out of ram and GCing a lot? Still it should not take THAT long Can you see some variations in the stacktraces or are they always pointing at the same things? The operations on the metadata take an awful lot of time, which I why I did NUTCH-702, however that does not explain why processing a dataset this size takes 20 days. J. 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> > I have lot of space left on the /tmp . I don't have separate partition > for /tmp... i have a folder called /tmp... There is lot of space > left.. close to 1.3Terabytes... > > 1.4T 55G 1.3T 5% / > tmpfs 3.8G 0 3.8G 0% /lib/init/rw > varrun 3.8G 120K 3.8G 1% /var/run > varlock 3.8G 0 3.8G 0% /var/lock > udev 3.8G 152K 3.8G 1% /dev > tmpfs 3.8G 0 3.8G 0% /dev/shm > lrm 3.8G 2.5M 3.8G 1% > /lib/modules/2.6.28-15-server/volatile > /dev/sda5 228M 29M 187M 14% /boot > /dev/sr0 388K 388K 0 100% /media/cdrom0 > > I also noticed that /tmp/hadoop-root directory is 6.8 Gb... > > I have attached the jstack of the process that is doing the update.... > below > > 2009-11-02 19:11:54 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 > tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition > [0x00007f4000e97000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.mapred.Task$1.run(Task.java:403) > at java.lang.Thread.run(Thread.java:619) > > "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable > [0x00007f4000f98000] > java.lang.Thread.State: RUNNABLE > at java.lang.Byte.hashCode(Byte.java:394) > at > java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) > at > org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) > - locked <0x00007f47ef4d9310> (a org.apache.hadoop.io.MapWritable) > at > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) > > "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in > Object.wait() [0x00007f40016f7000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007f400f63e6c0> (a > java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > - locked <0x00007f400f63e6c0> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb > in Object.wait() [0x00007f40017f8000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) > > "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition > [0x00007f49d75c2000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) > at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) > > "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable > > "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 > nid=0x25c6 runnable > > "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 > nid=0x25c7 runnable > > "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 > nid=0x25c8 runnable > > "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 > nid=0x25c9 runnable > > "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 > waiting on condition > > JNI global references: 907 > > > > Any help related to this would be really helpful... > > On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche > <lists.digitalpebble@...> wrote: > > Hi again > > > > > >> i know the process is not stuck.. and the process is running because i > >> turned on the hadoop logs and i can see logs being written to it... > >> I'm not sure how to check if the task is completely stuck or not... > >> > > > > run jps to identify the process id then *jstack id* several times to see > if > > it is blocked at the same place > > > > how much space do you have left on the partition where /tmp is mounted? > > > > J. > > > > > > > >> Below is the sample log as i'm sending this email.... Its been on the > >> updatedb process for the last 19 days and the it has been generating > >> debug logs similar to this........ > >> > >> Has anyone else has this same issue before... > >> > >> > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> org.apache.hadoop.mapred.Task$Counter with bundle > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > >> COMBINE_OUTPUT_RECORDS > >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > MAP_OUTPUT_RECORDS > >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > >> COMBINE_INPUT_RECORDS > >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% > >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: > >> record full = true > >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; > >> bufend = 13893589; bufvoid = 99614720 > >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend > >> = 65533; length = 327680 > >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 > >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map > output > >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 > >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) > >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) > >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) > >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) > >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) > >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments > >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last > >> merge-pass, with 5 segments left of total size: 1192 bytes > >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) > >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - > >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of > >> commiting > >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - > >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it > >> got interrupted > >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - > >> > >> > file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> org.apache.hadoop.mapred.Task$Counter with bundle > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> COMBINE_OUTPUT_RECORDS > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_OUTPUT_RECORDS > >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> COMBINE_INPUT_RECORDS > >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task > >> 'attempt_local_0001_m_000003_0' done. > >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 > >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - > >> java.io.IOException: config(config) > >> at > >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) > >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) > >> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) > >> > >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to > >> /tmp/hadoop-root/mapred/local/localRunner/split.dta > >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - > >> attempt_local_0001_m_000004_0 Progress/ping thread started > >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 > >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 > >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% > >> > >> Mathan > >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> wrote: > >> > Kalaimathan Mahenthiran wrote: > >> >> > >> >> I forgot to add the detail... > >> >> > >> >> The segment i'm trying to do updatedb on has 1.3 millions urls > fetched > >> >> and 1.08 million urls parsed.. > >> >> > >> >> Any help related to this would be appreciated... > >> >> > >> >> > >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > >> >> <mathan55@...> wrote: > >> >>> > >> >>> hi everyone > >> >>> > >> >>> I'm using nutch 1.0. I have fetched successfully and currently on > the > >> >>> updatedb process. I'm doing updatedb and its taking so long. I don't > >> >>> know why its taking this long. I have a new machine with quad core > >> >>> processor and 8 gb of ram. > >> >>> > >> >>> I believe this system is really good in terms of processing power. I > >> >>> don't think processing power is the problem here. I noticed that all > >> >>> the ram is getting using up. close to 7.7gb by the updatedb process. > >> >>> The computer is becoming is really slow. > >> >>> > >> >>> The updatedb process has been running for the last 19 days > continually > >> >>> with the message merging segment data into db.. Does anyone know why > >> >>> its taking so long... Is there any configuration setting i can do to > >> >>> increase the speed of the updatedb process... > >> > > >> > First, this process normally takes just a few minutes, depending on > the > >> > hardware, and not several days - so something is wrong. > >> > > >> > * do you run this in "local" or pseudo-distributed mode (i.e. running > a > >> real > >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, because > >> then > >> > you can monitor the progress in the web UI. > >> > > >> > * how many reduce tasks do you have? with large updates it helps if > you > >> run > >> >> 1 reducer, to split the final sorting. > >> > > >> > * if the task appears to be completely stuck, please generate a thread > >> dump > >> > (kill -SIGQUIT) and see where it's stuck. This could be related to > >> > urlfilter-regex or urlnormalizer-regex - you can identify if these are > >> > problematic by removing them from the config and re-running the > >> operation. > >> > > >> > * minor issue - when specifying the path names of segments and > crawldb, > >> do > >> > NOT append the trailing slash - it's not harmful in this particular > case, > >> > but you could have a nasty surprise when doing e.g. copy / mv > operations > >> ... > >> > > >> > -- > >> > Best regards, > >> > Andrzej Bialecki <>< > >> > ___. ___ ___ ___ _ _ __________________________________ > >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >> > ___|||__|| \| || | Embedded Unix, System Integration > >> > http://www.sigram.com Contact: info at sigram dot com > >> > > >> > > >> > > > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > -- DigitalPebble Ltd http://www.digitalpebble.com |
|
|
Re: updatedb is talking long long timeI can see that its running out of ram because... before starting
updatedb process i have approximately 7.7gb left on the system and as soon as this starts running for some time.. the ram comes to ~48 bytes... definitely its clogging all the ram space... i specified the heap size to be 9 gb.. in the hadoop-site.xml like below <property> <name>mapred.child.java.opts</name> <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> </propery> I have attached a screenshot of the jconsole view of the updatedb process... From jconsole i can see that cpu is not getting used at all.. its only being used 0.3~.5%. The system i'm using should not be a limitation because its an amd 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard disk space... Thanks again for all the help On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche <lists.digitalpebble@...> wrote: > OK. What heapsize did you specify for this job? Could it be that you are > running out of ram and GCing a lot? Still it should not take THAT long > Can you see some variations in the stacktraces or are they always pointing > at the same things? > The operations on the metadata take an awful lot of time, which I why I did > NUTCH-702, however that does not explain why processing a dataset this size > takes 20 days. > > J. > > 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> > >> I have lot of space left on the /tmp . I don't have separate partition >> for /tmp... i have a folder called /tmp... There is lot of space >> left.. close to 1.3Terabytes... >> >> 1.4T 55G 1.3T 5% / >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw >> varrun 3.8G 120K 3.8G 1% /var/run >> varlock 3.8G 0 3.8G 0% /var/lock >> udev 3.8G 152K 3.8G 1% /dev >> tmpfs 3.8G 0 3.8G 0% /dev/shm >> lrm 3.8G 2.5M 3.8G 1% >> /lib/modules/2.6.28-15-server/volatile >> /dev/sda5 228M 29M 187M 14% /boot >> /dev/sr0 388K 388K 0 100% /media/cdrom0 >> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... >> >> I have attached the jstack of the process that is doing the update.... >> below >> >> 2009-11-02 19:11:54 >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode): >> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition >> [0x00007f4000e97000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) >> at java.lang.Thread.run(Thread.java:619) >> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable >> [0x00007f4000f98000] >> java.lang.Thread.State: RUNNABLE >> at java.lang.Byte.hashCode(Byte.java:394) >> at >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) >> at >> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) >> - locked <0x00007f47ef4d9310> (a org.apache.hadoop.io.MapWritable) >> at >> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >> at >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) >> at >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 >> runnable [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd >> runnable [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in >> Object.wait() [0x00007f40016f7000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> - waiting on <0x00007f400f63e6c0> (a >> java.lang.ref.ReferenceQueue$Lock) >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >> - locked <0x00007f400f63e6c0> (a java.lang.ref.ReferenceQueue$Lock) >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb >> in Object.wait() [0x00007f40017f8000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> - waiting on <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) >> at java.lang.Object.wait(Object.java:485) >> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) >> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition >> [0x00007f49d75c2000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) >> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable >> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 >> nid=0x25c6 runnable >> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 >> nid=0x25c7 runnable >> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 >> nid=0x25c8 runnable >> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 >> nid=0x25c9 runnable >> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 >> waiting on condition >> >> JNI global references: 907 >> >> >> >> Any help related to this would be really helpful... >> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche >> <lists.digitalpebble@...> wrote: >> > Hi again >> > >> > >> >> i know the process is not stuck.. and the process is running because i >> >> turned on the hadoop logs and i can see logs being written to it... >> >> I'm not sure how to check if the task is completely stuck or not... >> >> >> > >> > run jps to identify the process id then *jstack id* several times to see >> if >> > it is blocked at the same place >> > >> > how much space do you have left on the partition where /tmp is mounted? >> > >> > J. >> > >> > >> > >> >> Below is the sample log as i'm sending this email.... Its been on the >> >> updatedb process for the last 19 days and the it has been generating >> >> debug logs similar to this........ >> >> >> >> Has anyone else has this same issue before... >> >> >> >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> >> org.apache.hadoop.mapred.Task$Counter with bundle >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >> >> COMBINE_OUTPUT_RECORDS >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> MAP_OUTPUT_RECORDS >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> >> COMBINE_INPUT_RECORDS >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: >> >> record full = true >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; >> >> bufend = 13893589; bufvoid = 99614720 >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend >> >> = 65533; length = 327680 >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map >> output >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last >> >> merge-pass, with 5 segments left of total size: 1192 bytes >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of >> >> commiting >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it >> >> got interrupted >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - >> >> >> >> >> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> >> org.apache.hadoop.mapred.Task$Counter with bundle >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> >> COMBINE_OUTPUT_RECORDS >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> MAP_OUTPUT_RECORDS >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> >> COMBINE_INPUT_RECORDS >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task >> >> 'attempt_local_0001_m_000003_0' done. >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - >> >> java.io.IOException: config(config) >> >> at >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) >> >> at >> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) >> >> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - >> >> attempt_local_0001_m_000004_0 Progress/ping thread started >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% >> >> >> >> Mathan >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> wrote: >> >> > Kalaimathan Mahenthiran wrote: >> >> >> >> >> >> I forgot to add the detail... >> >> >> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls >> fetched >> >> >> and 1.08 million urls parsed.. >> >> >> >> >> >> Any help related to this would be appreciated... >> >> >> >> >> >> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >> >> >> <mathan55@...> wrote: >> >> >>> >> >> >>> hi everyone >> >> >>> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on >> the >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I don't >> >> >>> know why its taking this long. I have a new machine with quad core >> >> >>> processor and 8 gb of ram. >> >> >>> >> >> >>> I believe this system is really good in terms of processing power. I >> >> >>> don't think processing power is the problem here. I noticed that all >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb process. >> >> >>> The computer is becoming is really slow. >> >> >>> >> >> >>> The updatedb process has been running for the last 19 days >> continually >> >> >>> with the message merging segment data into db.. Does anyone know why >> >> >>> its taking so long... Is there any configuration setting i can do to >> >> >>> increase the speed of the updatedb process... >> >> > >> >> > First, this process normally takes just a few minutes, depending on >> the >> >> > hardware, and not several days - so something is wrong. >> >> > >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. running >> a >> >> real >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, because >> >> then >> >> > you can monitor the progress in the web UI. >> >> > >> >> > * how many reduce tasks do you have? with large updates it helps if >> you >> >> run >> >> >> 1 reducer, to split the final sorting. >> >> > >> >> > * if the task appears to be completely stuck, please generate a thread >> >> dump >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these are >> >> > problematic by removing them from the config and re-running the >> >> operation. >> >> > >> >> > * minor issue - when specifying the path names of segments and >> crawldb, >> >> do >> >> > NOT append the trailing slash - it's not harmful in this particular >> case, >> >> > but you could have a nasty surprise when doing e.g. copy / mv >> operations >> >> ... >> >> > >> >> > -- >> >> > Best regards, >> >> > Andrzej Bialecki <>< >> >> > ___. ___ ___ ___ _ _ __________________________________ >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> >> > ___|||__|| \| || | Embedded Unix, System Integration >> >> > http://www.sigram.com Contact: info at sigram dot com >> >> > >> >> > >> >> >> > >> > >> > >> > -- >> > DigitalPebble Ltd >> > http://www.digitalpebble.com >> > >> > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > |
|
|
Re: updatedb is talking long long timeOK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I am
pretty sure that you are running out of memory because of the inlinks which are stored in RAM. Applying the patch NUTCH-702 would also help. I have modified the CrawlDBReducer and added another parameter *db .fetch.links.max* : * switch (datum.getStatus()) { // collect other info case CrawlDatum.STATUS_LINKED: if (maxLinks!=-1 && linked.size()>= maxLinks) break; * where maxLinks is a variable which I initialize from the configure() method * maxLinks = job.getInt("db.fetch.links.max", -1);* I have not tried *db.max.outlinks.per.page *at all but am pretty sure that * db.fetch.links.max* works fine. There is also a parameter *db.max.inlinks* but it affects only the LinkDBMerger Let us know if that fixes the problem Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> > I can see that its running out of ram because... before starting > updatedb process i have approximately 7.7gb left on the system and as > soon as this starts running for some time.. the ram comes to ~48 > bytes... > > definitely its clogging all the ram space... > > i specified the heap size to be 9 gb.. in the hadoop-site.xml like below > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> > </propery> > > I have attached a screenshot of the jconsole view of the updatedb > process... > From jconsole i can see that cpu is not getting used at all.. its only > being used 0.3~.5%. > > The system i'm using should not be a limitation because its an amd > 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard > disk space... > > Thanks again for all the help > > On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche > <lists.digitalpebble@...> wrote: > > OK. What heapsize did you specify for this job? Could it be that you are > > running out of ram and GCing a lot? Still it should not take THAT long > > Can you see some variations in the stacktraces or are they always > pointing > > at the same things? > > The operations on the metadata take an awful lot of time, which I why I > did > > NUTCH-702, however that does not explain why processing a dataset this > size > > takes 20 days. > > > > J. > > > > 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> > > > >> I have lot of space left on the /tmp . I don't have separate partition > >> for /tmp... i have a folder called /tmp... There is lot of space > >> left.. close to 1.3Terabytes... > >> > >> 1.4T 55G 1.3T 5% / > >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw > >> varrun 3.8G 120K 3.8G 1% /var/run > >> varlock 3.8G 0 3.8G 0% /var/lock > >> udev 3.8G 152K 3.8G 1% /dev > >> tmpfs 3.8G 0 3.8G 0% /dev/shm > >> lrm 3.8G 2.5M 3.8G 1% > >> /lib/modules/2.6.28-15-server/volatile > >> /dev/sda5 228M 29M 187M 14% /boot > >> /dev/sr0 388K 388K 0 100% /media/cdrom0 > >> > >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... > >> > >> I have attached the jstack of the process that is doing the update.... > >> below > >> > >> 2009-11-02 19:11:54 > >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed > mode): > >> > >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 > >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition > >> [0x00007f4000e97000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) > >> at java.lang.Thread.run(Thread.java:619) > >> > >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable > >> [0x00007f4000f98000] > >> java.lang.Thread.State: RUNNABLE > >> at java.lang.Byte.hashCode(Byte.java:394) > >> at > >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) > >> at > >> > org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) > >> - locked <0x00007f47ef4d9310> (a > org.apache.hadoop.io.MapWritable) > >> at > >> > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) > >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > >> at > >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) > >> at > >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > >> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) > >> > >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in > >> Object.wait() [0x00007f40016f7000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on <0x00007f400f63e6c0> (a > >> java.lang.ref.ReferenceQueue$Lock) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > >> - locked <0x00007f400f63e6c0> (a > java.lang.ref.ReferenceQueue$Lock) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > >> at > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > >> > >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb > >> in Object.wait() [0x00007f40017f8000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on <0x00007f400f63e6f8> (a > java.lang.ref.Reference$Lock) > >> at java.lang.Object.wait(Object.java:485) > >> at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) > >> > >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition > >> [0x00007f49d75c2000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) > >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) > >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) > >> > >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable > >> > >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 > >> nid=0x25c6 runnable > >> > >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 > >> nid=0x25c7 runnable > >> > >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 > >> nid=0x25c8 runnable > >> > >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 > >> nid=0x25c9 runnable > >> > >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 > >> waiting on condition > >> > >> JNI global references: 907 > >> > >> > >> > >> Any help related to this would be really helpful... > >> > >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche > >> <lists.digitalpebble@...> wrote: > >> > Hi again > >> > > >> > > >> >> i know the process is not stuck.. and the process is running because > i > >> >> turned on the hadoop logs and i can see logs being written to it... > >> >> I'm not sure how to check if the task is completely stuck or not... > >> >> > >> > > >> > run jps to identify the process id then *jstack id* several times to > see > >> if > >> > it is blocked at the same place > >> > > >> > how much space do you have left on the partition where /tmp is > mounted? > >> > > >> > J. > >> > > >> > > >> > > >> >> Below is the sample log as i'm sending this email.... Its been on the > >> >> updatedb process for the last 19 days and the it has been generating > >> >> debug logs similar to this........ > >> >> > >> >> Has anyone else has this same issue before... > >> >> > >> >> > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$Counter with bundle > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > >> >> COMBINE_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > MAP_INPUT_RECORDS > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > MAP_OUTPUT_BYTES > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > MAP_INPUT_BYTES > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > >> MAP_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > >> >> COMBINE_INPUT_RECORDS > >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: > >> >> record full = true > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; > >> >> bufend = 13893589; bufvoid = 99614720 > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; > kvend > >> >> = 65533; length = 327680 > >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 > >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map > >> output > >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) > >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted > segments > >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last > >> >> merge-pass, with 5 segments left of total size: 1192 bytes > >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) > >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - > >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of > >> >> commiting > >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - > >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it > >> >> got interrupted > >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - > >> >> > >> >> > >> > file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$Counter with bundle > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> >> COMBINE_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_INPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_OUTPUT_BYTES > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_INPUT_BYTES > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> MAP_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> >> COMBINE_INPUT_RECORDS > >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task > >> >> 'attempt_local_0001_m_000003_0' done. > >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 > 0:0 > >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - > >> >> java.io.IOException: config(config) > >> >> at > >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) > >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) > >> >> at > >> >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) > >> >> > >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to > >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta > >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - > >> >> attempt_local_0001_m_000004_0 Progress/ping thread started > >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 > >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 > >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% > >> >> > >> >> Mathan > >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> > wrote: > >> >> > Kalaimathan Mahenthiran wrote: > >> >> >> > >> >> >> I forgot to add the detail... > >> >> >> > >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls > >> fetched > >> >> >> and 1.08 million urls parsed.. > >> >> >> > >> >> >> Any help related to this would be appreciated... > >> >> >> > >> >> >> > >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > >> >> >> <mathan55@...> wrote: > >> >> >>> > >> >> >>> hi everyone > >> >> >>> > >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on > >> the > >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I > don't > >> >> >>> know why its taking this long. I have a new machine with quad > core > >> >> >>> processor and 8 gb of ram. > >> >> >>> > >> >> >>> I believe this system is really good in terms of processing > power. I > >> >> >>> don't think processing power is the problem here. I noticed that > all > >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb > process. > >> >> >>> The computer is becoming is really slow. > >> >> >>> > >> >> >>> The updatedb process has been running for the last 19 days > >> continually > >> >> >>> with the message merging segment data into db.. Does anyone know > why > >> >> >>> its taking so long... Is there any configuration setting i can do > to > >> >> >>> increase the speed of the updatedb process... > >> >> > > >> >> > First, this process normally takes just a few minutes, depending on > >> the > >> >> > hardware, and not several days - so something is wrong. > >> >> > > >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. > running > >> a > >> >> real > >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, > because > >> >> then > >> >> > you can monitor the progress in the web UI. > >> >> > > >> >> > * how many reduce tasks do you have? with large updates it helps if > >> you > >> >> run > >> >> >> 1 reducer, to split the final sorting. > >> >> > > >> >> > * if the task appears to be completely stuck, please generate a > thread > >> >> dump > >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to > >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these > are > >> >> > problematic by removing them from the config and re-running the > >> >> operation. > >> >> > > >> >> > * minor issue - when specifying the path names of segments and > >> crawldb, > >> >> do > >> >> > NOT append the trailing slash - it's not harmful in this particular > >> case, > >> >> > but you could have a nasty surprise when doing e.g. copy / mv > >> operations > >> >> ... > >> >> > > >> >> > -- > >> >> > Best regards, > >> >> > Andrzej Bialecki <>< > >> >> > ___. ___ ___ ___ _ _ __________________________________ > >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >> >> > ___|||__|| \| || | Embedded Unix, System Integration > >> >> > http://www.sigram.com Contact: info at sigram dot com > >> >> > > >> >> > > >> >> > >> > > >> > > >> > > >> > -- > >> > DigitalPebble Ltd > >> > http://www.digitalpebble.com > >> > > >> > > > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > |
|
|
Re: updatedb is talking long long timeHello Kalaimathan,
Any luck with your updateDB? I would be curious to know if the tricks I suggested worked. J. -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/3 Julien Nioche <lists.digitalpebble@...> > OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I > am pretty sure that you are running out of memory because of the inlinks > which are stored in RAM. > Applying the patch NUTCH-702 would also help. > > I have modified the CrawlDBReducer and added another parameter *db > .fetch.links.max* : > > * switch (datum.getStatus()) { // collect other info > case CrawlDatum.STATUS_LINKED: > if (maxLinks!=-1 && linked.size()>= maxLinks) break; > * > where maxLinks is a variable which I initialize from the configure() method > > * maxLinks = job.getInt("db.fetch.links.max", -1);* > > I have not tried *db.max.outlinks.per.page *at all but am pretty sure that > *db.fetch.links.max* works fine. > > There is also a parameter *db.max.inlinks* but it affects only the > LinkDBMerger > > Let us know if that fixes the problem > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > > 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> > >> I can see that its running out of ram because... before starting >> updatedb process i have approximately 7.7gb left on the system and as >> soon as this starts running for some time.. the ram comes to ~48 >> bytes... >> >> definitely its clogging all the ram space... >> >> i specified the heap size to be 9 gb.. in the hadoop-site.xml like below >> <property> >> <name>mapred.child.java.opts</name> >> <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> >> </propery> >> >> I have attached a screenshot of the jconsole view of the updatedb >> process... >> From jconsole i can see that cpu is not getting used at all.. its only >> being used 0.3~.5%. >> >> The system i'm using should not be a limitation because its an amd >> 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard >> disk space... >> >> Thanks again for all the help >> >> On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche >> <lists.digitalpebble@...> wrote: >> > OK. What heapsize did you specify for this job? Could it be that you are >> > running out of ram and GCing a lot? Still it should not take THAT long >> > Can you see some variations in the stacktraces or are they always >> pointing >> > at the same things? >> > The operations on the metadata take an awful lot of time, which I why I >> did >> > NUTCH-702, however that does not explain why processing a dataset this >> size >> > takes 20 days. >> > >> > J. >> > >> > 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> >> > >> >> I have lot of space left on the /tmp . I don't have separate partition >> >> for /tmp... i have a folder called /tmp... There is lot of space >> >> left.. close to 1.3Terabytes... >> >> >> >> 1.4T 55G 1.3T 5% / >> >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw >> >> varrun 3.8G 120K 3.8G 1% /var/run >> >> varlock 3.8G 0 3.8G 0% /var/lock >> >> udev 3.8G 152K 3.8G 1% /dev >> >> tmpfs 3.8G 0 3.8G 0% /dev/shm >> >> lrm 3.8G 2.5M 3.8G 1% >> >> /lib/modules/2.6.28-15-server/volatile >> >> /dev/sda5 228M 29M 187M 14% /boot >> >> /dev/sr0 388K 388K 0 100% /media/cdrom0 >> >> >> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... >> >> >> >> I have attached the jstack of the process that is doing the update.... >> >> below >> >> >> >> 2009-11-02 19:11:54 >> >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed >> mode): >> >> >> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b >> >> waiting on condition [0x0000000000000000] >> >> java.lang.Thread.State: RUNNABLE >> >> >> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 >> >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition >> >> [0x00007f4000e97000] >> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> >> at java.lang.Thread.sleep(Native Method) >> >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) >> >> at java.lang.Thread.run(Thread.java:619) >> >> >> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable >> >> [0x00007f4000f98000] >> >> java.lang.Thread.State: RUNNABLE >> >> at java.lang.Byte.hashCode(Byte.java:394) >> >> at >> >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) >> >> at >> >> >> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) >> >> - locked <0x00007f47ef4d9310> (a >> org.apache.hadoop.io.MapWritable) >> >> at >> >> >> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) >> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) >> >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >> >> at >> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) >> >> at >> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >> >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> >> at >> >> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >> >> >> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 >> >> runnable [0x0000000000000000] >> >> java.lang.Thread.State: RUNNABLE >> >> >> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf >> >> waiting on condition [0x0000000000000000] >> >> java.lang.Thread.State: RUNNABLE >> >> >> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce >> >> waiting on condition [0x0000000000000000] >> >> java.lang.Thread.State: RUNNABLE >> >> >> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd >> >> runnable [0x0000000000000000] >> >> java.lang.Thread.State: RUNNABLE >> >> >> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in >> >> Object.wait() [0x00007f40016f7000] >> >> java.lang.Thread.State: WAITING (on object monitor) >> >> at java.lang.Object.wait(Native Method) >> >> - waiting on <0x00007f400f63e6c0> (a >> >> java.lang.ref.ReferenceQueue$Lock) >> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >> >> - locked <0x00007f400f63e6c0> (a >> java.lang.ref.ReferenceQueue$Lock) >> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >> >> at >> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >> >> >> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb >> >> in Object.wait() [0x00007f40017f8000] >> >> java.lang.Thread.State: WAITING (on object monitor) >> >> at java.lang.Object.wait(Native Method) >> >> - waiting on <0x00007f400f63e6f8> (a >> java.lang.ref.Reference$Lock) >> >> at java.lang.Object.wait(Object.java:485) >> >> at >> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >> >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) >> >> >> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition >> >> [0x00007f49d75c2000] >> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> >> at java.lang.Thread.sleep(Native Method) >> >> at >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) >> >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) >> >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) >> >> >> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable >> >> >> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 >> >> nid=0x25c6 runnable >> >> >> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 >> >> nid=0x25c7 runnable >> >> >> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 >> >> nid=0x25c8 runnable >> >> >> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 >> >> nid=0x25c9 runnable >> >> >> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 >> >> waiting on condition >> >> >> >> JNI global references: 907 >> >> >> >> >> >> >> >> Any help related to this would be really helpful... >> >> >> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche >> >> <lists.digitalpebble@...> wrote: >> >> > Hi again >> >> > >> >> > >> >> >> i know the process is not stuck.. and the process is running because >> i >> >> >> turned on the hadoop logs and i can see logs being written to it... >> >> >> I'm not sure how to check if the task is completely stuck or not... >> >> >> >> >> > >> >> > run jps to identify the process id then *jstack id* several times to >> see >> >> if >> >> > it is blocked at the same place >> >> > >> >> > how much space do you have left on the partition where /tmp is >> mounted? >> >> > >> >> > J. >> >> > >> >> > >> >> > >> >> >> Below is the sample log as i'm sending this email.... Its been on >> the >> >> >> updatedb process for the last 19 days and the it has been generating >> >> >> debug logs similar to this........ >> >> >> >> >> >> Has anyone else has this same issue before... >> >> >> >> >> >> >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >> >> >> COMBINE_OUTPUT_RECORDS >> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >> MAP_INPUT_RECORDS >> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> MAP_OUTPUT_BYTES >> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> MAP_INPUT_BYTES >> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> >> MAP_OUTPUT_RECORDS >> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >> >> >> COMBINE_INPUT_RECORDS >> >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% >> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: >> >> >> record full = true >> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; >> >> >> bufend = 13893589; bufvoid = 99614720 >> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; >> kvend >> >> >> = 65533; length = 327680 >> >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 >> >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map >> >> output >> >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 >> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) >> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) >> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) >> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) >> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) >> >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted >> segments >> >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last >> >> >> merge-pass, with 5 segments left of total size: 1192 bytes >> >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) >> >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - >> >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of >> >> >> commiting >> >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - >> >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it >> >> >> got interrupted >> >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - >> >> >> >> >> >> >> >> >> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> >> >> COMBINE_OUTPUT_RECORDS >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> MAP_INPUT_RECORDS >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> MAP_OUTPUT_BYTES >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> MAP_INPUT_BYTES >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> >> MAP_OUTPUT_RECORDS >> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >> >> >> COMBINE_INPUT_RECORDS >> >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task >> >> >> 'attempt_local_0001_m_000003_0' done. >> >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 >> 0:0 >> >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - >> >> >> java.io.IOException: config(config) >> >> >> at >> >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) >> >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) >> >> >> at >> >> >> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) >> >> >> >> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split >> to >> >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta >> >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - >> >> >> attempt_local_0001_m_000004_0 Progress/ping thread started >> >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 >> >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 >> >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% >> >> >> >> >> >> Mathan >> >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> >> wrote: >> >> >> > Kalaimathan Mahenthiran wrote: >> >> >> >> >> >> >> >> I forgot to add the detail... >> >> >> >> >> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls >> >> fetched >> >> >> >> and 1.08 million urls parsed.. >> >> >> >> >> >> >> >> Any help related to this would be appreciated... >> >> >> >> >> >> >> >> >> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >> >> >> >> <mathan55@...> wrote: >> >> >> >>> >> >> >> >>> hi everyone >> >> >> >>> >> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently >> on >> >> the >> >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I >> don't >> >> >> >>> know why its taking this long. I have a new machine with quad >> core >> >> >> >>> processor and 8 gb of ram. >> >> >> >>> >> >> >> >>> I believe this system is really good in terms of processing >> power. I >> >> >> >>> don't think processing power is the problem here. I noticed that >> all >> >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb >> process. >> >> >> >>> The computer is becoming is really slow. >> >> >> >>> >> >> >> >>> The updatedb process has been running for the last 19 days >> >> continually >> >> >> >>> with the message merging segment data into db.. Does anyone know >> why >> >> >> >>> its taking so long... Is there any configuration setting i can >> do to >> >> >> >>> increase the speed of the updatedb process... >> >> >> > >> >> >> > First, this process normally takes just a few minutes, depending >> on >> >> the >> >> >> > hardware, and not several days - so something is wrong. >> >> >> > >> >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. >> running >> >> a >> >> >> real >> >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, >> because >> >> >> then >> >> >> > you can monitor the progress in the web UI. >> >> >> > >> >> >> > * how many reduce tasks do you have? with large updates it helps >> if >> >> you >> >> >> run >> >> >> >> 1 reducer, to split the final sorting. >> >> >> > >> >> >> > * if the task appears to be completely stuck, please generate a >> thread >> >> >> dump >> >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to >> >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these >> are >> >> >> > problematic by removing them from the config and re-running the >> >> >> operation. >> >> >> > >> >> >> > * minor issue - when specifying the path names of segments and >> >> crawldb, >> >> >> do >> >> >> > NOT append the trailing slash - it's not harmful in this >> particular >> >> case, >> >> >> > but you could have a nasty surprise when doing e.g. copy / mv >> >> operations >> >> >> ... >> >> >> > >> >> >> > -- >> >> >> > Best regards, >> >> >> > Andrzej Bialecki <>< >> >> >> > ___. ___ ___ ___ _ _ __________________________________ >> >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> >> >> > ___|||__|| \| || | Embedded Unix, System Integration >> >> >> > http://www.sigram.com Contact: info at sigram dot com >> >> >> > >> >> >> > >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > DigitalPebble Ltd >> >> > http://www.digitalpebble.com >> >> > >> >> >> > >> > >> > >> > -- >> > DigitalPebble Ltd >> > http://www.digitalpebble.com >> > >> > > > |
|
|
Re: updatedb is talking long long timeHi
I have tried ur suggestion of lowering db.max.outlinks.per.page to a smaller number. I could not reparse the segment as the segment was already parsed... I tried modifying some other variables such as java_heap memory and mapreduce_child_opts values... modifying these values triggered some exceptions. Therefore i have generated a new segment (considering maybe something is wrong with the previous segment). and redoing the fetching process. Once this is complete then i will try to do updatedb again and see if that works... Mathan On Fri, Nov 6, 2009 at 5:40 AM, Julien Nioche <lists.digitalpebble@...> wrote: > Hello Kalaimathan, > > Any luck with your updateDB? I would be curious to know if the tricks I > suggested worked. > > J. > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/11/3 Julien Nioche <lists.digitalpebble@...> > >> OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I >> am pretty sure that you are running out of memory because of the inlinks >> which are stored in RAM. >> Applying the patch NUTCH-702 would also help. >> >> I have modified the CrawlDBReducer and added another parameter *db >> .fetch.links.max* : >> >> * switch (datum.getStatus()) { // collect other info >> case CrawlDatum.STATUS_LINKED: >> if (maxLinks!=-1 && linked.size()>= maxLinks) break; >> * >> where maxLinks is a variable which I initialize from the configure() method >> >> * maxLinks = job.getInt("db.fetch.links.max", -1);* >> >> I have not tried *db.max.outlinks.per.page *at all but am pretty sure that >> *db.fetch.links.max* works fine. >> >> There is also a parameter *db.max.inlinks* but it affects only the >> LinkDBMerger >> >> Let us know if that fixes the problem >> >> Julien >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> >> >> 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> >> >>> I can see that its running out of ram because... before starting >>> updatedb process i have approximately 7.7gb left on the system and as >>> soon as this starts running for some time.. the ram comes to ~48 >>> bytes... >>> >>> definitely its clogging all the ram space... >>> >>> i specified the heap size to be 9 gb.. in the hadoop-site.xml like below >>> <property> >>> <name>mapred.child.java.opts</name> >>> <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> >>> </propery> >>> >>> I have attached a screenshot of the jconsole view of the updatedb >>> process... >>> From jconsole i can see that cpu is not getting used at all.. its only >>> being used 0.3~.5%. >>> >>> The system i'm using should not be a limitation because its an amd >>> 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard >>> disk space... >>> >>> Thanks again for all the help >>> >>> On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche >>> <lists.digitalpebble@...> wrote: >>> > OK. What heapsize did you specify for this job? Could it be that you are >>> > running out of ram and GCing a lot? Still it should not take THAT long >>> > Can you see some variations in the stacktraces or are they always >>> pointing >>> > at the same things? >>> > The operations on the metadata take an awful lot of time, which I why I >>> did >>> > NUTCH-702, however that does not explain why processing a dataset this >>> size >>> > takes 20 days. >>> > >>> > J. >>> > >>> > 2009/11/3 Kalaimathan Mahenthiran <mathan55@...> >>> > >>> >> I have lot of space left on the /tmp . I don't have separate partition >>> >> for /tmp... i have a folder called /tmp... There is lot of space >>> >> left.. close to 1.3Terabytes... >>> >> >>> >> 1.4T 55G 1.3T 5% / >>> >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw >>> >> varrun 3.8G 120K 3.8G 1% /var/run >>> >> varlock 3.8G 0 3.8G 0% /var/lock >>> >> udev 3.8G 152K 3.8G 1% /dev >>> >> tmpfs 3.8G 0 3.8G 0% /dev/shm >>> >> lrm 3.8G 2.5M 3.8G 1% >>> >> /lib/modules/2.6.28-15-server/volatile >>> >> /dev/sda5 228M 29M 187M 14% /boot >>> >> /dev/sr0 388K 388K 0 100% /media/cdrom0 >>> >> >>> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... >>> >> >>> >> I have attached the jstack of the process that is doing the update.... >>> >> below >>> >> >>> >> 2009-11-02 19:11:54 >>> >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed >>> mode): >>> >> >>> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 >>> >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition >>> >> [0x00007f4000e97000] >>> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> >> at java.lang.Thread.sleep(Native Method) >>> >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) >>> >> at java.lang.Thread.run(Thread.java:619) >>> >> >>> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable >>> >> [0x00007f4000f98000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> at java.lang.Byte.hashCode(Byte.java:394) >>> >> at >>> >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) >>> >> at >>> >> >>> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) >>> >> - locked <0x00007f47ef4d9310> (a >>> org.apache.hadoop.io.MapWritable) >>> >> at >>> >> >>> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) >>> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >>> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) >>> >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >>> >> at >>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) >>> >> at >>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >>> >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >>> >> at >>> >> >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >>> >> >>> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 >>> >> runnable [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd >>> >> runnable [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in >>> >> Object.wait() [0x00007f40016f7000] >>> >> java.lang.Thread.State: WAITING (on object monitor) >>> >> at java.lang.Object.wait(Native Method) >>> >> - waiting on <0x00007f400f63e6c0> (a >>> >> java.lang.ref.ReferenceQueue$Lock) >>> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >>> >> - locked <0x00007f400f63e6c0> (a >>> java.lang.ref.ReferenceQueue$Lock) >>> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >>> >> at >>> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >>> >> >>> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb >>> >> in Object.wait() [0x00007f40017f8000] >>> >> java.lang.Thread.State: WAITING (on object monitor) >>> >> at java.lang.Object.wait(Native Method) >>> >> - waiting on <0x00007f400f63e6f8> (a >>> java.lang.ref.Reference$Lock) >>> >> at java.lang.Object.wait(Object.java:485) >>> >> at >>> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >>> >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) >>> >> >>> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition >>> >> [0x00007f49d75c2000] >>> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> >> at java.lang.Thread.sleep(Native Method) >>> >> at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) >>> >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) >>> >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) >>> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) >>> >> >>> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable >>> >> >>> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 >>> >> nid=0x25c6 runnable >>> >> >>> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 >>> >> nid=0x25c7 runnable >>> >> >>> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 >>> >> nid=0x25c8 runnable >>> >> >>> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 >>> >> nid=0x25c9 runnable >>> >> >>> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 >>> >> waiting on condition >>> >> >>> >> JNI global references: 907 >>> >> >>> >> >>> >> >>> >> Any help related to this would be really helpful... >>> >> >>> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche >>> >> <lists.digitalpebble@...> wrote: >>> >> > Hi again >>> >> > >>> >> > >>> >> >> i know the process is not stuck.. and the process is running because >>> i >>> >> >> turned on the hadoop logs and i can see logs being written to it... >>> >> >> I'm not sure how to check if the task is completely stuck or not... >>> >> >> >>> >> > >>> >> > run jps to identify the process id then *jstack id* several times to >>> see >>> >> if >>> >> > it is blocked at the same place >>> >> > >>> >> > how much space do you have left on the partition where /tmp is >>> mounted? >>> >> > >>> >> > J. >>> >> > >>> >> > >>> >> > >>> >> >> Below is the sample log as i'm sending this email.... Its been on >>> the >>> >> >> updatedb process for the last 19 days and the it has been generating >>> >> >> debug logs similar to this........ >>> >> >> >>> >> >> Has anyone else has this same issue before... >>> >> >> >>> >> >> >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >>> MAP_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> MAP_OUTPUT_BYTES >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> MAP_INPUT_BYTES >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> >> MAP_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: >>> >> >> record full = true >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; >>> >> >> bufend = 13893589; bufvoid = 99614720 >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; >>> kvend >>> >> >> = 65533; length = 327680 >>> >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 >>> >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map >>> >> output >>> >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted >>> segments >>> >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last >>> >> >> merge-pass, with 5 segments left of total size: 1192 bytes >>> >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) >>> >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - >>> >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of >>> >> >> commiting >>> >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - >>> >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it >>> >> >> got interrupted >>> >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - >>> >> >> >>> >> >> >>> >> >>> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_OUTPUT_BYTES >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_INPUT_BYTES >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> MAP_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task >>> >> >> 'attempt_local_0001_m_000003_0' done. >>> >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 >>> 0:0 >>> >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - >>> >> >> java.io.IOException: config(config) >>> >> >> at >>> >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) >>> >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) >>> >> >> at >>> >> >> >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) >>> >> >> >>> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split >>> to >>> >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta >>> >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - >>> >> >> attempt_local_0001_m_000004_0 Progress/ping thread started >>> >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 >>> >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 >>> >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% >>> >> >> >>> >> >> Mathan >>> >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <ab@...> >>> wrote: >>> >> >> > Kalaimathan Mahenthiran wrote: >>> >> >> >> >>> >> >> >> I forgot to add the detail... >>> >> >> >> >>> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls >>> >> fetched >>> >> >> >> and 1.08 million urls parsed.. >>> >> >> >> >>> >> >> >> Any help related to this would be appreciated... >>> >> >> >> >>> >> >> >> >>> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >>> >> >> >> <mathan55@...> wrote: >>> >> >> >>> >>> >> >> >>> hi everyone >>> >> >> >>> >>> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently >>> on >>> >> the >>> >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I >>> don't >>> >> >> >>> know why its taking this long. I have a new machine with quad >>> core >>> >> >> >>> processor and 8 gb of ram. >>> >> >> >>> >>> >> >> >>> I believe this system is really good in terms of processing >>> power. I >>> >> >> >>> don't think processing power is the problem here. I noticed that >>> all >>> >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb >>> process. >>> >> >> >>> The computer is becoming is really slow. >>> >> >> >>> >>> >> >> >>> The updatedb process has been running for the last 19 days >>> >> continually >>> >> >> >>> with the message merging segment data into db.. Does anyone know >>> why >>> >> >> >>> its taking so long... Is there any configuration setting i can >>> do to >>> >> >> >>> increase the speed of the updatedb process... >>> >> >> > >>> >> >> > First, this process normally takes just a few minutes, depending >>> on >>> >> the >>> >> >> > hardware, and not several days - so something is wrong. >>> >> >> > >>> >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. >>> running >>> >> a >>> >> >> real >>> >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, >>> because >>> >> >> then >>> >> >> > you can monitor the progress in the web UI. >>> >> >> > >>> >> >> > * how many reduce tasks do you have? with large updates it helps >>> if >>> >> you >>> >> >> run >>> >> >> >> 1 reducer, to split the final sorting. >>> >> >> > >>> >> >> > * if the task appears to be completely stuck, please generate a >>> thread >>> >> >> dump >>> >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to >>> >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these >>> are >>> >> >> > problematic by removing them from the config and re-running the >>> >> >> operation. >>> >> >> > >>> >> >> > * minor issue - when specifying the path names of segments and >>> >> crawldb, >>> >> >> do >>> >> >> > NOT append the trailing slash - it's not harmful in this >>> particular >>> >> case, >>> >> >> > but you could have a nasty surprise when doing e.g. copy / mv >>> >> operations >>> >> >> ... >>> >> >> > >>> >> >> > -- >>> >> >> > Best regards, >>> >> >> > Andrzej Bialecki <>< >>> >> >> > ___. ___ ___ ___ _ _ __________________________________ >>> >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> >> >> > ___|||__|| \| || | Embedded Unix, System Integration >>> >> >> > http://www.sigram.com Contact: info at sigram dot com >>> >> >> > >>> >> >> > >>> >> >> >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > DigitalPebble Ltd >>> >> > http://www.digitalpebble.com >>> >> > >>> >> >>> > >>> > >>> > >>> > -- >>> > DigitalPebble Ltd >>> > http://www.digitalpebble.com >>> > >>> >> >> >> > |
| Free embeddable forum powered by Nabble | Forum Help |