[jira] Created: (HADOOP-2393) healthy tasks dying because tasktracker unresponsive while trying to kill errant tasks

View: New views
4 Messages — Rating Filter:   Alert me  

[jira] Created: (HADOOP-2393) healthy tasks dying because tasktracker unresponsive while trying to kill errant tasks

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

healthy tasks dying because tasktracker unresponsive while trying to kill errant tasks
--------------------------------------------------------------------------------------

                 Key: HADOOP-2393
                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.13.1
         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
ipc.client.timeout = 10000
            Reporter: Joydeep Sen Sarma


we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.

Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.

tasktracker log:

// notice the good 10+ second gap in logs on otherwise busy node:
2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.                                      
2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.                                        
2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
24 active threads                                                                                                                                  

... huge stack trace dump in logfile ...

something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:

2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
 discarded for being too old (21380)                                                                                                                
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
 discarded for being too old (21380)                                                                                                                
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
 discarded for being too old (10367)                                                                                                                
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
 discarded for being too old (10360)                                                                                                                
2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error    

looking at code, failure of client to ping causes termination:

              else {                                                                                                                                
                // send ping                                                                                                                        
                taskFound = umbilical.ping(taskId);                                                                                                
              }                                                                                                                                    
...
            catch (Throwable t) {                                                                                                                  
              LOG.info("Communication exception: " + StringUtils.stringifyException(t));                                                            
              remainingRetries -=1;                                                                                                                
              if (remainingRetries == 0) {                                                                                                          
                ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);                                                                  
                LOG.warn("Last retry, killing "+taskId);                                                                                            
                System.exit(65);                                                                                                                    

exit code is 65 as reported by task tracker.

i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2393) healthy tasks dying because unable to ping tasktracker

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma updated HADOOP-2393:
--------------------------------------

             Priority: Critical  (was: Major)
    Affects Version/s:     (was: 0.13.1)
                       0.14.4
              Summary: healthy tasks dying because unable to ping tasktracker  (was: healthy tasks dying because tasktracker unresponsive while trying to kill errant tasks)

trying to get a response here. ignore initial filing comments. tasks dying due to exit code 65 is the #1 cause of task failure in our environment. we have a timeout value of 10s and a substantial set of tasks fail with this error. fixing this would substantially improve the efficiency of our cluster. i understand this is simply the task committing suicide after failing to ping the tasktracker.

i am going to try some things out - for starters just trying to run tasktrackers at higher process priority than the slave nodes. i don't want to raise the global timeout because it makes everything less responsive (retries take toooo long).

if developers have some insight on how to tune the ping codepath to make it more resilient - that would be awesome.

> healthy tasks dying because unable to ping tasktracker
> ------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.                                      
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.                                        
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
> 24 active threads                                                                                                                                  
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                                                                                                
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error    
> looking at code, failure of client to ping causes termination:
>               else {                                                                                                                                
>                 // send ping                                                                                                                        
>                 taskFound = umbilical.ping(taskId);                                                                                                
>               }                                                                                                                                    
> ...
>             catch (Throwable t) {                                                                                                                  
>               LOG.info("Communication exception: " + StringUtils.stringifyException(t));                                                            
>               remainingRetries -=1;                                                                                                                
>               if (remainingRetries == 0) {                                                                                                          
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);                                                                  
>                 LOG.warn("Last retry, killing "+taskId);                                                                                            
>                 System.exit(65);                                                                                                                    
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2393) healthy tasks dying because unable to ping tasktracker

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561228#action_12561228 ]

Devaraj Das commented on HADOOP-2393:
-------------------------------------

I am thinking whether the task should base its decision on whether to kill itself or not on _connect_ as opposed to the ping RPC failures. As your tasktracker logs points out, the tasktracker is just too slow in processing the RPCs for some reason, and might eventually recover. In the case where the tasktracker is actually down, the _connect_ call would fail and task can kill itself. Also, the task should kill itself when the ping RPC is successful and returns a false. Comments?

> healthy tasks dying because unable to ping tasktracker
> ------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.                                      
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.                                        
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
> 24 active threads                                                                                                                                  
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                                                                                                
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error    
> looking at code, failure of client to ping causes termination:
>               else {                                                                                                                                
>                 // send ping                                                                                                                        
>                 taskFound = umbilical.ping(taskId);                                                                                                
>               }                                                                                                                                    
> ...
>             catch (Throwable t) {                                                                                                                  
>               LOG.info("Communication exception: " + StringUtils.stringifyException(t));                                                            
>               remainingRetries -=1;                                                                                                                
>               if (remainingRetries == 0) {                                                                                                          
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);                                                                  
>                 LOG.warn("Last retry, killing "+taskId);                                                                                            
>                 System.exit(65);                                                                                                                    
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2393) healthy tasks dying because unable to ping tasktracker

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561423#action_12561423 ]

Joydeep Sen Sarma commented on HADOOP-2393:
-------------------------------------------

that would seem acceptable.

on similar lines - the start-stop scripts look at the pids file to detect if the tasktracker is running. we could use the same logic (signal the tasktracker pid to see if it's alive). dunno if this is easily doable in java.

> healthy tasks dying because unable to ping tasktracker
> ------------------------------------------------------
>
>                 Key: HADOOP-2393
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2393
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.4
>         Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
> ipc.client.timeout = 10000
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
> Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
> tasktracker log:
> // notice the good 10+ second gap in logs on otherwise busy node:
> 2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.                                      
> 2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.                                        
> 2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
> 24 active threads                                                                                                                                  
> ... huge stack trace dump in logfile ...
> something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (21380)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
>  discarded for being too old (10367)                                                                                                                
> 2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
>  discarded for being too old (10360)                                                                                                                
> 2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error    
> looking at code, failure of client to ping causes termination:
>               else {                                                                                                                                
>                 // send ping                                                                                                                        
>                 taskFound = umbilical.ping(taskId);                                                                                                
>               }                                                                                                                                    
> ...
>             catch (Throwable t) {                                                                                                                  
>               LOG.info("Communication exception: " + StringUtils.stringifyException(t));                                                            
>               remainingRetries -=1;                                                                                                                
>               if (remainingRetries == 0) {                                                                                                          
>                 ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);                                                                  
>                 LOG.warn("Last retry, killing "+taskId);                                                                                            
>                 System.exit(65);                                                                                                                    
> exit code is 65 as reported by task tracker.
> i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.