Region server going down

View: New views
4 Messages — Rating Filter:   Alert me  

Region server going down

by Lucas Nazário dos Santos :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Today one regionserver crashed and I can't figure out why. Everything
started with the message "server,60020,1255644477834 znode expired". I'm
still running the cluster on little memory and swap is getting in my way
from time to time (it's rare but I need to fix it). Can it be the cause of
the error bellow? Do you think that five minutes is enough for the property
zookeeper.session.timeout? Why the message "wrong key class:
org.apache.hadoop.hbase.regionserver.HLogKey is not class"?

My tests show that whenever zookeeper "shakes" the whole cluster goes down.
Shouldn't HBase be more robust regarding Zookeeper? Something like a retry
strategy...

Lucas



2009-10-16 15:07:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 2
region servers, 0 dead, average load 7.0
2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 192.168.1.2:60020,
regionname: -ROOT-,,0, startKey: <>}
2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 192.168.1.3:60020,
regionname: .META.,,1, startKey: <>}
2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 12 row(s) of meta region {server:
192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2009-10-16 15:08:09,551 INFO org.apache.hadoop.hbase.master.ServerManager:
server,60020,1255644477834 znode expired
2009-10-16 15:08:09,605 INFO org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be reassigned)
2009-10-16 15:08:09,605 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
server server,60020,1255644477834: logSplit: false, rootRescanned: false,
numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
Splitting 20 hlog(s) in
hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
Exception processing
hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
-- continuing. Possible DATA LOSS!
java.io.IOException: wrong key class:
org.apache.hadoop.hbase.regionserver.HLogKey is not class
org.apache.hadoop.hbase.regionserver.transactional.THLogKey
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
        at
org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
        at
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
Exception processing
hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
-- continuing. Possible DATA LOSS!
java.io.IOException: wrong key class:
org.apache.hadoop.hbase.regionserver.HLogKey is not class
org.apache.hadoop.hbase.regionserver.transactional.THLogKey
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
        at
org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
        at
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556

// More wrong key class errors...

2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog
file splitting completed in 594 millis for
hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
2009-10-16 15:08:10,203 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete,
meta reassignment and scanning:
2009-10-16 15:08:10,203 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown
reassigning ROOT region
2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be reassigned)
2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
ROOT inserted into regionsInTransition
2009-10-16 15:08:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 1 dead, average load 6.0[server,60020,1255644477834]

Re: Region server going down

by Ryan Rawson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hey,

Zookeeper is a pretty fundamental part of how we are making things
happen in hbase.  The problem is when you lose your session, this is
how we synchronize between the master and the regionserver.  At this
point neither side knows what the other knows, and the safest thing is
to abort the regionserver.  Without that, we can end up with multiple
region assignments which is pretty messy.

ZK is like DNS and the network, without it running, we are more or
less in trouble.  There is no effective difference between a crashed
machine and one that is having network problems, so they are treated
the same and recovery is the same.

Having said that, the session timeout is set in hbase, and i think
ships at 40 seconds or so.  So it should take more than a minor
problem or a few lost packets to induce a crash.  Now having said
that, if you are killing the entire ZK cluster and expecting HBase to
be ok, that is not really what will happen.  This is why ZK is run in
a 2N+1 scenario, so you can do rolling reboots, and survive N machine
loss.  But ZK is requires to be up 24/7, luckily it is fairly
reliable.

With hdfs 0.21, at least we'll be able to have effective hlog recovery.

Now, your specific problem looks like a common issue with the master
and regionservers being confused about what type of server they are
running. I don't personally run the indexed or transactional
extensions (they are not as inherently scalable), so maybe someone
else can chime in.

-ryan

On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
<nazario.lucas@...> wrote:

> Hi,
>
> Today one regionserver crashed and I can't figure out why. Everything
> started with the message "server,60020,1255644477834 znode expired". I'm
> still running the cluster on little memory and swap is getting in my way
> from time to time (it's rare but I need to fix it). Can it be the cause of
> the error bellow? Do you think that five minutes is enough for the property
> zookeeper.session.timeout? Why the message "wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
>
> My tests show that whenever zookeeper "shakes" the whole cluster goes down.
> Shouldn't HBase be more robust regarding Zookeeper? Something like a retry
> strategy...
>
> Lucas
>
>
>
> 2009-10-16 15:07:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 2
> region servers, 0 dead, average load 7.0
> 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 192.168.1.2:60020,
> regionname: -ROOT-,,0, startKey: <>}
> 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 192.168.1.3:60020,
> regionname: .META.,,1, startKey: <>}
> 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2009-10-16 15:08:09,551 INFO org.apache.hadoop.hbase.master.ServerManager:
> server,60020,1255644477834 znode expired
> 2009-10-16 15:08:09,605 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2009-10-16 15:08:09,605 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server server,60020,1255644477834: logSplit: false, rootRescanned: false,
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Splitting 20 hlog(s) in
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> -- continuing. Possible DATA LOSS!
> java.io.IOException: wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class
> org.apache.hadoop.hbase.regionserver.transactional.THLogKey
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
>        at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
>        at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> -- continuing. Possible DATA LOSS!
> java.io.IOException: wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class
> org.apache.hadoop.hbase.regionserver.transactional.THLogKey
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
>        at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
>        at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
>
> // More wrong key class errors...
>
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog
> file splitting completed in 594 millis for
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete,
> meta reassignment and scanning:
> 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown
> reassigning ROOT region
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
> ROOT inserted into regionsInTransition
> 2009-10-16 15:08:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 1 dead, average load 6.0[server,60020,1255644477834]
>

Re: Region server going down

by Lucas Nazário dos Santos :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks a lot Ryan. Very helpful your explanation. It's not the first time
that I see someone saying that the indexed option is not "as inherently
scalable". I'll remove it and take care of my indexes manually. Also, I need
to fix the swap problem.

Lucas




On Fri, Oct 16, 2009 at 10:12 PM, Ryan Rawson <ryanobjc@...> wrote:

> Hey,
>
> Zookeeper is a pretty fundamental part of how we are making things
> happen in hbase.  The problem is when you lose your session, this is
> how we synchronize between the master and the regionserver.  At this
> point neither side knows what the other knows, and the safest thing is
> to abort the regionserver.  Without that, we can end up with multiple
> region assignments which is pretty messy.
>
> ZK is like DNS and the network, without it running, we are more or
> less in trouble.  There is no effective difference between a crashed
> machine and one that is having network problems, so they are treated
> the same and recovery is the same.
>
> Having said that, the session timeout is set in hbase, and i think
> ships at 40 seconds or so.  So it should take more than a minor
> problem or a few lost packets to induce a crash.  Now having said
> that, if you are killing the entire ZK cluster and expecting HBase to
> be ok, that is not really what will happen.  This is why ZK is run in
> a 2N+1 scenario, so you can do rolling reboots, and survive N machine
> loss.  But ZK is requires to be up 24/7, luckily it is fairly
> reliable.
>
> With hdfs 0.21, at least we'll be able to have effective hlog recovery.
>
> Now, your specific problem looks like a common issue with the master
> and regionservers being confused about what type of server they are
> running. I don't personally run the indexed or transactional
> extensions (they are not as inherently scalable), so maybe someone
> else can chime in.
>
> -ryan
>
> On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
> <nazario.lucas@...> wrote:
> > Hi,
> >
> > Today one regionserver crashed and I can't figure out why. Everything
> > started with the message "server,60020,1255644477834 znode expired". I'm
> > still running the cluster on little memory and swap is getting in my way
> > from time to time (it's rare but I need to fix it). Can it be the cause
> of
> > the error bellow? Do you think that five minutes is enough for the
> property
> > zookeeper.session.timeout? Why the message "wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
> >
> > My tests show that whenever zookeeper "shakes" the whole cluster goes
> down.
> > Shouldn't HBase be more robust regarding Zookeeper? Something like a
> retry
> > strategy...
> >
> > Lucas
> >
> >
> >
> > 2009-10-16 15:07:32,167 INFO
> org.apache.hadoop.hbase.master.ServerManager: 2
> > region servers, 0 dead, average load 7.0
> > 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scanning meta region {server:
> 192.168.1.2:60020,
> > regionname: -ROOT-,,0, startKey: <>}
> > 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> > 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scanning meta region {server:
> 192.168.1.3:60020,
> > regionname: .META.,,1, startKey: <>}
> > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> > 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All
> > 1 .META. region(s) scanned
> > 2009-10-16 15:08:09,551 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > server,60020,1255644477834 znode expired
> > 2009-10-16 15:08:09,605 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > -ROOT- region unset (but not set to be reassigned)
> > 2009-10-16 15:08:09,605 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> > server server,60020,1255644477834: logSplit: false, rootRescanned: false,
> > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 20 hlog(s) in
> > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> > -- continuing. Possible DATA LOSS!
> > java.io.IOException: wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> >        at
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> >        at
> > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> > -- continuing. Possible DATA LOSS!
> > java.io.IOException: wrong key class:
> > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> >        at
> org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> >        at
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> >        at
> > org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
> >
> > // More wrong key class errors...
> >
> > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog:
> hlog
> > file splitting completed in 594 millis for
> > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete,
> > meta reassignment and scanning:
> > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation:
> ProcessServerShutdown
> > reassigning ROOT region
> > 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > -ROOT- region unset (but not set to be reassigned)
> > 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > ROOT inserted into regionsInTransition
> > 2009-10-16 15:08:32,167 INFO
> org.apache.hadoop.hbase.master.ServerManager: 1
> > region servers, 1 dead, average load 6.0[server,60020,1255644477834]
> >
>

Re: Region server going down

by Clint Morgan-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

In your first post, you are hitting 1858. Fixed in trunk and 0.20 branch,
but you will need to add the config value to recover from the WAL.

I take issue with Ryan's handwavy statement about index/trx extensions not
being scalable.

With the indexing you pay an extra cost on puts which is essentially a
constant * number of indexes. But this would still scale with the number of
rows/requests.  If you want those indexes, then you will have to pay that
maintenance cost. And putting the maintenance in the regionserver makes the
gets to rebuild the indexes a bit cheaper.

Trx is a different story; it really depends on your work loads. But if you
have lots of small requests that don't often interfere with each other, then
it should scale.

On Mon, Oct 19, 2009 at 3:42 AM, Lucas Nazário dos Santos <
nazario.lucas@...> wrote:

> Thanks a lot Ryan. Very helpful your explanation. It's not the first time
> that I see someone saying that the indexed option is not "as inherently
> scalable". I'll remove it and take care of my indexes manually. Also, I
> need
> to fix the swap problem.
>
> Lucas
>
>
>
>
> On Fri, Oct 16, 2009 at 10:12 PM, Ryan Rawson <ryanobjc@...> wrote:
>
> > Hey,
> >
> > Zookeeper is a pretty fundamental part of how we are making things
> > happen in hbase.  The problem is when you lose your session, this is
> > how we synchronize between the master and the regionserver.  At this
> > point neither side knows what the other knows, and the safest thing is
> > to abort the regionserver.  Without that, we can end up with multiple
> > region assignments which is pretty messy.
> >
> > ZK is like DNS and the network, without it running, we are more or
> > less in trouble.  There is no effective difference between a crashed
> > machine and one that is having network problems, so they are treated
> > the same and recovery is the same.
> >
> > Having said that, the session timeout is set in hbase, and i think
> > ships at 40 seconds or so.  So it should take more than a minor
> > problem or a few lost packets to induce a crash.  Now having said
> > that, if you are killing the entire ZK cluster and expecting HBase to
> > be ok, that is not really what will happen.  This is why ZK is run in
> > a 2N+1 scenario, so you can do rolling reboots, and survive N machine
> > loss.  But ZK is requires to be up 24/7, luckily it is fairly
> > reliable.
> >
> > With hdfs 0.21, at least we'll be able to have effective hlog recovery.
> >
> > Now, your specific problem looks like a common issue with the master
> > and regionservers being confused about what type of server they are
> > running. I don't personally run the indexed or transactional
> > extensions (they are not as inherently scalable), so maybe someone
> > else can chime in.
> >
> > -ryan
> >
> > On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
> > <nazario.lucas@...> wrote:
> > > Hi,
> > >
> > > Today one regionserver crashed and I can't figure out why. Everything
> > > started with the message "server,60020,1255644477834 znode expired".
> I'm
> > > still running the cluster on little memory and swap is getting in my
> way
> > > from time to time (it's rare but I need to fix it). Can it be the cause
> > of
> > > the error bellow? Do you think that five minutes is enough for the
> > property
> > > zookeeper.session.timeout? Why the message "wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
> > >
> > > My tests show that whenever zookeeper "shakes" the whole cluster goes
> > down.
> > > Shouldn't HBase be more robust regarding Zookeeper? Something like a
> > retry
> > > strategy...
> > >
> > > Lucas
> > >
> > >
> > >
> > > 2009-10-16 15:07:32,167 INFO
> > org.apache.hadoop.hbase.master.ServerManager: 2
> > > region servers, 0 dead, average load 7.0
> > > 2009-10-16 15:07:32,537 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.rootScanner scanning meta region {server:
> > 192.168.1.2:60020,
> > > regionname: -ROOT-,,0, startKey: <>}
> > > 2009-10-16 15:07:32,560 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > > 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> > > 2009-10-16 15:07:32,654 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.metaScanner scanning meta region {server:
> > 192.168.1.3:60020,
> > > regionname: .META.,,1, startKey: <>}
> > > 2009-10-16 15:07:32,804 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > > RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> > > 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> > > 2009-10-16 15:07:32,804 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> > All
> > > 1 .META. region(s) scanned
> > > 2009-10-16 15:08:09,551 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > server,60020,1255644477834 znode expired
> > > 2009-10-16 15:08:09,605 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > -ROOT- region unset (but not set to be reassigned)
> > > 2009-10-16 15:08:09,605 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown
> of
> > > server server,60020,1255644477834: logSplit: false, rootRescanned:
> false,
> > > numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > > 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > > Splitting 20 hlog(s) in
> > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > > 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing
> > >
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> > > -- continuing. Possible DATA LOSS!
> > > java.io.IOException: wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> > >        at
> > >
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> > >        at
> > >
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> > >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > > 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing
> > >
> >
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> > > -- continuing. Possible DATA LOSS!
> > > java.io.IOException: wrong key class:
> > > org.apache.hadoop.hbase.regionserver.HLogKey is not class
> > > org.apache.hadoop.hbase.regionserver.transactional.THLogKey
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
> > >        at
> > > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
> > >        at
> > org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
> > >        at
> > >
> >
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
> > >        at
> > >
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
> > >        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> > > 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> > > Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
> > >
> > > // More wrong key class errors...
> > >
> > > 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > hlog
> > > file splitting completed in 594 millis for
> > > hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> > > 2009-10-16 15:08:10,203 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation: Log split
> complete,
> > > meta reassignment and scanning:
> > > 2009-10-16 15:08:10,203 INFO
> > > org.apache.hadoop.hbase.master.RegionServerOperation:
> > ProcessServerShutdown
> > > reassigning ROOT region
> > > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > -ROOT- region unset (but not set to be reassigned)
> > > 2009-10-16 15:08:10,203 INFO
> > org.apache.hadoop.hbase.master.RegionManager:
> > > ROOT inserted into regionsInTransition
> > > 2009-10-16 15:08:32,167 INFO
> > org.apache.hadoop.hbase.master.ServerManager: 1
> > > region servers, 1 dead, average load 6.0[server,60020,1255644477834]
> > >
> >
>