[jira] Created: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

View: New views
5 Messages — Rating Filter:   Alert me  

[jira] Created: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
-----------------------------------------------------------------------

                 Key: HADOOP-2660
                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
            Reporter: Billy Pearson
            Priority: Critical
             Fix For: 0.16.0


I thank we addressed this here
HADOOP-2295

but I have found it showing up again
my hlog size is set to 250,000

so on a recovery from a failed region server the recovery of scanning the logs takes longer then the
hbase.hbasemaster.maxregionopen default of 30 secs

and the master is thinks the region is open but the region server closes the region when done recovering becuase the master sent a
MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.

I was able to get my table back online completely by adding
hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my hbase-site.xml file
and restart.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560562#action_12560562 ]

Billy Pearson commented on HADOOP-2660:
---------------------------------------

I was going to try and find the information in the logs but the logs are way big with all the debug info from the recovery

> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the logs takes longer then the
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the region when done recovering becuase the master sent a
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my hbase-site.xml file
> and restart.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Billy Pearson updated HADOOP-2660:
----------------------------------

    Priority: Major  (was: Critical)

> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>             Fix For: 0.16.0
>
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the logs takes longer then the
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the region when done recovering becuase the master sent a
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my hbase-site.xml file
> and restart.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560641#action_12560641 ]

Billy Pearson commented on HADOOP-2660:
---------------------------------------

I thank that of two options that would help solve this problem and might need to use both

option 1
build in a backlog limit on how many pending opens we can have in any one region server before stop accepting new opens.  
example finding the maximum sequence id for a region takes a lot less time then doing a recovery to a region. So its que would fill up faster making the master send some open request to different servers while this one catches up or loop until one of the region servers has open slots in it pending open que. I thank 60 secs is the default loop time so they should be able to hand 10 pending opens or something like that many be make it an option limit in the conf.

option 2

1.Confirm we received the masters open request once we received it

Once confirmed master should not reassign the region to any other region server unless the region server goes off line and loses it lease

2 Confirm the open of the region success or failed

The master can make sure the region server is still alive by keeping up with heartbeat

> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>             Fix For: 0.16.0
>
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the logs takes longer then the
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the region when done recovering becuase the master sent a
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my hbase-site.xml file
> and restart.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561859#action_12561859 ]

Jim Kellerman commented on HADOOP-2660:
---------------------------------------

> I thank that of two options that would help solve this problem and might need to use both

> option 1
> build in a backlog limit on how many pending opens we can have in any one region server before stop
> accepting new opens.
> example finding the maximum sequence id for a region takes a lot less time then doing a recovery to a
> region. So its que would fill up faster making the master send some open request to different servers
> while this one catches up or loop until one of the region servers has open slots in it pending open que. I
> thank 60 secs is the default loop time so they should be able to hand 10 pending opens or something
> like that many be make it an option limit in the conf.

> option 2
>
> 1.Confirm we received the masters open request once we received it
>
> Once confirmed master should not reassign the region to any other region server unless the region
> server goes off line and loses it lease

In fact this exactly what happens today. When a region server receives an open region request, it replies
in its next heartbeat to the master with MSG_REPORT_PROCESS_OPEN which means, I got your request
and am working on it. When the master receives this message, it adds
hbase.hbasemaster.maxregionopen (currently 30 seconds) to the amount of time before it will try to
assign the region again. If it is taking longer than 30 seconds for a region server to open a region,
I would suggest increasing the value of this parameter to 60000 (60 seconds).

> 2 Confirm the open of the region success or failed

When the region server has opened the region, it sends a MSG_REPORT_OPEN to the master
meaning that it is now serving the region.

> The master can make sure the region server is still alive by keeping up with heartbeat

It is the region server that sends the heartbeat to the master, but this is exactly what happens.



> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the logs takes longer then the
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the region when done recovering becuase the master sent a
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my hbase-site.xml file
> and restart.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.