HBase Exceptions on version 0.20.1

View: New views
15 Messages — Rating Filter:   Alert me  

HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hello Everyone,
 
We are experiencing many HBase exceptions while running the attached
test program using
HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
with 14 data nodes.
 
The HBase operational performance degrades quickly once certain error
conditions are
encountered especially if there are issues with the META region.
 
Has anyone else seen these exceptions listed below?
 
We ran the test with 3 region servers and then with 14 region servers.
In both cases, we still got exceptions.
 
In each test run, we shutdown HBase, removed the main /hbase
folder, restarted HBase and then ran the test program.
The test program uses a table named "fc_test". The program is started by
the following command:
 
/opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
 
The attached HBase test program starts 30 threads. Each thread loops,
inserting randomly generated keys
along with a fixed data pattern of 4,000,000 bytes.
 
After running the test for 2-12 hours, the table scanning of "fc_test"
is broken and only a limited number of put
operations are still succeeding.
 

 
EXCEPTION LIST
 
------------------------------
2009-10-15 03:28:39,005 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
on store data; time since last major compaction 89677908ms
2009-10-15 03:28:39,012 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-6054029382496843671_324379 from any node:
java.io.IOException: No live nodes contain current block
------------------------------
2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
response for sessionid:0x2244f6194a5000d after 1ms
2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
<mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
java.io.IOException: TIMED OUT
     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2009-10-15 07:41:54,344 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server hfs-029014/10.1.29.14:2181
2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
------------------------------
2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
    at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
------------------------------
2009-10-15 07:57:28,025 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
(1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
Evicted/Run=NaN
2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
in createBlockOutputStream java.io.EOFException
2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_7166000434569842394_332632
2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
editcount=146
-----------------------------
2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
while closing send thread for session 0x1244f6194780008 : Read error rc
= -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
Disconnecting ClientCnxn for session: 0x1244f6194780008
2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
EventThread shut down
------------------------------
2009-10-13 12:30:58,207 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
master address from ZooKeeper. Retrying. Error was:
java.io.IOException:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /hbase/master
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
    at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
------------------------------
2009-10-14 15:12:42,664 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
    at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
    at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
    at java.lang.Thread.run(Unknown Source)
------------------------------
2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
<mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
java.io.IOException: Session Expired
    at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2009-10-14 15:12:43,567 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Expired, type: None, path: null
2009-10-14 15:12:43,568 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired
------------------------------
2009-10-14 15:12:51,563 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
java.io.IOException: Cannot append; log is closed
    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
------------------------------
2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
cancel cleanup thread, though no FileSystems are open
2009-10-14 15:13:51,864 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
HRegionServer: file system not available
java.io.IOException: File system is not available
    at
org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
failed on local exception: java.nio.channels.ClosedByInterruptException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
    at org.apache.hadoop.ipc.Client.call(Client.java:742)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy1.getFileInfo(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy1.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
    at
org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
    ... 9 more
Caused by: java.nio.channels.ClosedByInterruptException
    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
Source)
    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
    at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
    at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
    at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    at java.io.BufferedOutputStream.flush(Unknown Source)
    at java.io.DataOutputStream.flush(Unknown Source)
    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
    at org.apache.hadoop.ipc.Client.call(Client.java:720)
------------------------------
 
PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
 
2009-10-13 00:00:00,508 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
out of range for ...
    ....
    at
org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
    at
org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
    at
org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
------------------------------
2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 7 on 60020, call put([B@1f5b92a,
[Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
  at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
   at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
 at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
 at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-10-12 10:06:57,812 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 3 on 60020, call put([B@1f52982,
[Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
   at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
 at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
 at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
------------------------------
 
 
 
 
 
.
 

Re: HBase Exceptions on version 0.20.1

by Jonathan Gray-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Most of these exceptions look related to overloaded servers (GC pauses
causing timeouts, high IO wait tripping up the datanodes, etc).  Have
you turned on GC logging?  Also, are you swapping on these nodes?

Check out the performance tuning page here:

http://wiki.apache.org/hadoop/PerformanceTuning

The WrongRegionException at the end could be a fault but it's hard to
know without seeing the entire context and knowing what the cluster was
up to at that point.

Performance can degrade as the JVMs fill up, get more and more
fragmented, and the GC gets slower.

Also, you are inserting 4MB values?  Those are fairly large, at the
upper-end of what you would want to put into HBase.  Is this your actual
use case?  At the least you'll want to increase your region size
(otherwise you're going to have at most 64 rows per region, often less),
but also consider if HBase is the right place to store 4MB values.

Hope that helps.

JG

elsif wrote:

> Hello Everyone,
>  
> We are experiencing many HBase exceptions while running the attached
> test program using
> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
> with 14 data nodes.
>  
> The HBase operational performance degrades quickly once certain error
> conditions are
> encountered especially if there are issues with the META region.
>  
> Has anyone else seen these exceptions listed below?
>  
> We ran the test with 3 region servers and then with 14 region servers.
> In both cases, we still got exceptions.
>  
> In each test run, we shutdown HBase, removed the main /hbase
> folder, restarted HBase and then ran the test program.
> The test program uses a table named "fc_test". The program is started by
> the following command:
>  
> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>  
> The attached HBase test program starts 30 threads. Each thread loops,
> inserting randomly generated keys
> along with a fixed data pattern of 4,000,000 bytes.
>  
> After running the test for 2-12 hours, the table scanning of "fc_test"
> is broken and only a limited number of put
> operations are still succeeding.
>  
>
>  
> EXCEPTION LIST
>  
> ------------------------------
> 2009-10-15 03:28:39,005 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
> on store data; time since last major compaction 89677908ms
> 2009-10-15 03:28:39,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-6054029382496843671_324379 from any node:
> java.io.IOException: No live nodes contain current block
> ------------------------------
> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x2244f6194a5000d after 1ms
> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
> <mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
> java.io.IOException: TIMED OUT
>      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-10-15 07:41:54,344 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server hfs-029014/10.1.29.14:2181
> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
> ------------------------------
> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>     at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> ------------------------------
> 2009-10-15 07:57:28,025 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
> Evicted/Run=NaN
> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in createBlockOutputStream java.io.EOFException
> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7166000434569842394_332632
> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
> editcount=146
> -----------------------------
> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
> while closing send thread for session 0x1244f6194780008 : Read error rc
> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> Disconnecting ClientCnxn for session: 0x1244f6194780008
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> ------------------------------
> 2009-10-13 12:30:58,207 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
> master address from ZooKeeper. Retrying. Error was:
> java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /hbase/master
>     at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>     at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>     at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
> ------------------------------
> 2009-10-14 15:12:42,664 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>     at java.lang.reflect.Constructor.newInstance(Unknown Source)
>     at
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>     at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>     at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>     at java.lang.Thread.run(Unknown Source)
> ------------------------------
> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
> <mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
> java.io.IOException: Session Expired
>     at
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>     at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2009-10-14 15:12:43,567 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2009-10-14 15:12:43,568 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> ------------------------------
> 2009-10-14 15:12:51,563 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Cannot append; log is closed
>     at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>     at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>     at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>     at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>     at java.lang.reflect.Method.invoke(Unknown Source)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
> cancel cleanup thread, though no FileSystems are open
> 2009-10-14 15:13:51,864 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>     at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>     at java.lang.reflect.Method.invoke(Unknown Source)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
> failed on local exception: java.nio.channels.ClosedByInterruptException
>     at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>     at org.apache.hadoop.ipc.Client.call(Client.java:742)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>     at $Proxy1.getFileInfo(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>     at java.lang.reflect.Method.invoke(Unknown Source)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>     at $Proxy1.getFileInfo(Unknown Source)
>     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>     at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>     at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>     ... 9 more
> Caused by: java.nio.channels.ClosedByInterruptException
>     at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
> Source)
>     at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>     at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>     at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>     at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>     at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>     at java.io.BufferedOutputStream.flush(Unknown Source)
>     at java.io.DataOutputStream.flush(Unknown Source)
>     at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>     at org.apache.hadoop.ipc.Client.call(Client.java:720)
> ------------------------------
>  
> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>  
> 2009-10-13 00:00:00,508 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
> out of range for ...
>     ....
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>     at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>     at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>     at java.lang.reflect.Method.invoke(Unknown Source)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020, call put([B@1f5b92a,
> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>   at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-10-12 10:06:57,812 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020, call put([B@1f52982,
> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
>  
>  
>  
>  
>  
> .
>  

Re: HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The cpu load on each of the nodes never goes above 1 and very little if
any swap is in use.

The 4MB is our boundary case - the actual data will always be smaller.
This was just a encapsulated test that reproduces our issue.

I will re-test with the CMS collector and logging enabled and reply back.

Thanks,
elsif

> Most of these exceptions look related to overloaded servers (GC pauses
> causing timeouts, high IO wait tripping up the datanodes, etc).  Have
> you turned on GC logging?  Also, are you swapping on these nodes?
>
> Check out the performance tuning page here:
>
> http://wiki.apache.org/hadoop/PerformanceTuning
>
> The WrongRegionException at the end could be a fault but it's hard to
> know without seeing the entire context and knowing what the cluster
> was up to at that point.
>
> Performance can degrade as the JVMs fill up, get more and more
> fragmented, and the GC gets slower.
>
> Also, you are inserting 4MB values?  Those are fairly large, at the
> upper-end of what you would want to put into HBase.  Is this your
> actual use case?  At the least you'll want to increase your region
> size (otherwise you're going to have at most 64 rows per region, often
> less), but also consider if HBase is the right place to store 4MB values.
>
> Hope that helps.
>
> JG
>


Re: HBase Exceptions on version 0.20.1

by Andrew Purtell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The reason JG points to load as being a problem as all signs point to it: This is usually the culprit behind DFS "no live block" errors -- the namenode is too busy and/or falling behind, or the datanodes are falling behind, or actually failing. Also, in the log snippets you provide, HBase is complaining about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly indicative of load, write load. Shortly after this, Zookeeper sessions begin expiring, which is also usually indicative of overloading -- heartbeats miss their deadline.

When I see these signs on my test clusters, I/O wait is generally in excess of 40%.

If your total CPU load is really just a few % (user + system + iowait), then I'd suggest you look at the storage layer. Is there anything in the datanode logs that seems like it might be relevant?

What about the network? Gigabit? Any potential sources of contention? Are you tracking network utilization metrics during the test?

Also, you might consider using Ganglia to monitor and correlate system metrics and HBase and HDFS metrics during your testing, if you are not doing this already.

   - Andy




________________________________
From: elsif <elsif.then@...>
To: hbase-user@...
Sent: Wed, October 21, 2009 7:47:43 AM
Subject: Re: HBase Exceptions on version 0.20.1

The cpu load on each of the nodes never goes above 1 and very little if
any swap is in use.

The 4MB is our boundary case - the actual data will always be smaller.
This was just a encapsulated test that reproduces our issue.

I will re-test with the CMS collector and logging enabled and reply back.

Thanks,
elsif

> Most of these exceptions look related to overloaded servers (GC pauses
> causing timeouts, high IO wait tripping up the datanodes, etc).  Have
> you turned on GC logging?  Also, are you swapping on these nodes?
>
> Check out the performance tuning page here:
>
> http://wiki.apache.org/hadoop/PerformanceTuning
>
> The WrongRegionException at the end could be a fault but it's hard to
> know without seeing the entire context and knowing what the cluster
> was up to at that point.
>
> Performance can degrade as the JVMs fill up, get more and more
> fragmented, and the GC gets slower.
>
> Also, you are inserting 4MB values?  Those are fairly large, at the
> upper-end of what you would want to put into HBase.  Is this your
> actual use case?  At the least you'll want to increase your region
> size (otherwise you're going to have at most 64 rows per region, often
> less), but also consider if HBase is the right place to store 4MB values.
>
> Hope that helps.
>
> JG
>


     

Re: HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


While running the test on this cluster of 14 servers, the highest loads
I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the
node serving the .META. region.  All the machines are on a single
gigabit switch dedicated to the cluster.  The highest throughput between
nodes has been 21.4MBps Rx on the node hosting the .META. region.

There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
in the hadoop logs over 12 hours of running the test.

I understand that I am loading the cluster - that is the point of the
test, but I don't think that this should result in data loss.  Failed
inserts at the client level I can handle, but loss of data that was
previously thought to be stored in hbase is a major issue.  Are there
plans to make hbase more resilient to load based failures?

Regards,
elsif

Andrew Purtell wrote:

> The reason JG points to load as being a problem as all signs point to it: This is usually the culprit behind DFS "no live block" errors -- the namenode is too busy and/or falling behind, or the datanodes are falling behind, or actually failing. Also, in the log snippets you provide, HBase is complaining about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly indicative of load, write load. Shortly after this, Zookeeper sessions begin expiring, which is also usually indicative of overloading -- heartbeats miss their deadline.
>
> When I see these signs on my test clusters, I/O wait is generally in excess of 40%.
>
> If your total CPU load is really just a few % (user + system + iowait), then I'd suggest you look at the storage layer. Is there anything in the datanode logs that seems like it might be relevant?
>
> What about the network? Gigabit? Any potential sources of contention? Are you tracking network utilization metrics during the test?
>
> Also, you might consider using Ganglia to monitor and correlate system metrics and HBase and HDFS metrics during your testing, if you are not doing this already.
>
>    - Andy
>
>  


Re: HBase Exceptions on version 0.20.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 21, 2009 at 8:16 AM, elsif <elsif.then@...> wrote:

>
> There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
> found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
> in the hadoop logs over 12 hours of running the test.
>

Above are from application-level (hbase) or datanode logs?  If you trace any
of the above -- follow the block name -- in the NN are the blocks lost or do
you see replicas taking over or recoveries triggered?


>
> I understand that I am loading the cluster - that is the point of the
> test, but I don't think that this should result in data loss.  Failed
> inserts at the client level I can handle, but loss of data that was
> previously thought to be stored in hbase is a major issue.  Are there
> plans to make hbase more resilient to load based failures?
>
>
It looks like there'll be data loss going by a few of the exceptions you
provide originally.  Here's a couple of comments:

"No live nodes contain current block"  Usually we see this if the
client-side hadoop has not been patched with hdfs-127/hadoop-4681.  Your
test program doesn't seem to have come across.  Mind attaching it to an
issue so I can try it?  Going by the way you started your test program, you
should have the hbase patched hadoop first in your CLASSPATH so you should
be ok but maybe there is something about your environmnent frustrating
hbase's using a patched hadoop?

"java.io.IOException: TIMED OUT" Your regionserver or master timed out its
zk session.  GC or swapping or disk used by zk is under heavy i/o loading?

"ClosedChannelException" Probably symptom of a RS shutdown because of events
such as above.

"Abandoning block..." Did this write to HLog fail? Its just an INFO level
log out of DFSClient.

"file system not available" What happened before this?  Was this just an
emission on regionserver shutdown?

St.Ack

Re: HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

stack wrote:

> On Wed, Oct 21, 2009 at 8:16 AM, elsif <elsif.then@...> wrote:
>
>  
>> There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
>> found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
>> in the hadoop logs over 12 hours of running the test.
>>
>>    
>
> Above are from application-level (hbase) or datanode logs?  If you trace any
> of the above -- follow the block name -- in the NN are the blocks lost or do
> you see replicas taking over or recoveries triggered?
>
>
>  
The above error counts are from the datanode logs.  It looks like blocks
are being requested after they are deleted (HBASE-1894):

hdfs:2009-10-21 11:30:55,580 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.1.29.12:50010,
storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075,
ipcPort=50020):Transmitted block blk_-6090437519946022724_469173 to
/10.1.29.20:50010

hdfs:2009-10-21 11:30:59,938 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block
blk_-6090437519946022724_469173 file
/mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724

hdfs:2009-10-21 11:59:09,509 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.1.29.12:50010,
storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075,
ipcPort=50020):Got exception while serving
blk_-6090437519946022724_469173 to /10.1.29.14:
hdfs:java.io.IOException: Block blk_-6090437519946022724_469173 is not
valid.

>> I understand that I am loading the cluster - that is the point of the
>> test, but I don't think that this should result in data loss.  Failed
>> inserts at the client level I can handle, but loss of data that was
>> previously thought to be stored in hbase is a major issue.  Are there
>> plans to make hbase more resilient to load based failures?
>>
>>
>>    
> It looks like there'll be data loss going by a few of the exceptions you
> provide originally.  Here's a couple of comments:
>
> "No live nodes contain current block"  Usually we see this if the
> client-side hadoop has not been patched with hdfs-127/hadoop-4681.  Your
> test program doesn't seem to have come across.  Mind attaching it to an
> issue so I can try it?  Going by the way you started your test program, you
> should have the hbase patched hadoop first in your CLASSPATH so you should
> be ok but maybe there is something about your environmnent frustrating
> hbase's using a patched hadoop?
>
>  
The test program is short and is in-line at the end of this message.  In
this last run, the client was running against the hadoop-0.20.1-core.jar
instead of the hadoop-0.20.1-hdfs127-core.jar.  I will retest using the
hdfs-127 jar.
> "java.io.IOException: TIMED OUT" Your regionserver or master timed out its
> zk session.  GC or swapping or disk used by zk is under heavy i/o loading?
>  
There are five zk servers in the test setup that share hardware with
nodes that are also datanodes and regionservers.  There are only 8
instances of this error and all of them occurred within 30 seconds of
each other on the same node.
> "ClosedChannelException" Probably symptom of a RS shutdown because of events
> such as above.
>
> "Abandoning block..." Did this write to HLog fail? Its just an INFO level
> log out of DFSClient.
>  
Since these are just INFO level messages, they are not an issue?

hbase:2009-10-21 11:40:57,082 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-1365065510928989381_470690
hdfs:2009-10-21 11:40:57,080 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-1365065510928989381_470690 src: /10.1.29.23:56117 dest:
/10.1.29.23:50010
hdfs:2009-10-21 11:40:57,081 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-1365065510928989381_470690 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/eac8b34a-a4a4-495b-8241-43cd4b21c28f/current/blk_-8854288022995097022_462986.meta':
No such file or directory

hbase:2009-10-21 11:36:12,609 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_4473876596208463127_470012
hdfs:2009-10-21 11:35:44,842 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.22:57916 dest:
/10.1.29.14:50010
hdfs:2009-10-21 11:35:44,844 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_4473876596208463127_470012 received exception java.io.EOFException
hdfs:2009-10-21 11:36:26,105 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.14:34537 dest:
/10.1.29.21:50010
hdfs:2009-10-21 11:36:26,105 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_4473876596208463127_470012 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/e9726517-c84c-4326-8859-44f8261a14fa/current/subdir26/subdir62/blk_4151284369367444478':
No such file or directory
hdfs:2009-10-21 11:36:12,605 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.22:34103 dest:
/10.1.29.22:50010
hdfs:2009-10-21 11:36:12,609 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_4473876596208463127_470012 2 Exception java.io.EOFException
hdfs:2009-10-21 11:36:12,609 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
block blk_4473876596208463127_470012 terminating

hbase:2009-10-21 06:47:10,189 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-6810931018745175038_422166
hdfs:2009-10-21 06:47:10,171 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-6810931018745175038_422166 src: /10.1.29.12:44933 dest:
/10.1.29.12:50010
hdfs:2009-10-21 06:47:10,188 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-6810931018745175038_422166 received exception java.io.EOFException
hdfs:2009-10-21 06:46:43,362 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-6810931018745175038_422166 src: /10.1.29.12:48908 dest:
/10.1.29.14:50010
hdfs:2009-10-21 06:46:43,362 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-6810931018745175038_422166 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/1a3e57e4-b514-4a85-83b1-ae9d12e3fb7a/current/subdir27/blk_-3769845572103484957_410554.meta':
No such file or directory

> "file system not available" What happened before this?  Was this just an
> emission on regionserver shutdown?
>  
I don't have any "file system not available" messages this time around.
> St.Ack
>
>  
Here is the source for the test program.  For my tests I run two
instances with the default settings:

--

package org.apache.hadoop.hbase.util;

import java.io.IOException;
import java.io.FileInputStream;
import java.lang.Integer;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;

/* create 'fc_test', {NAME => 'json', VERSIONS => 1}, {NAME => 'data',
VERSIONS =>1} */

public class LoadTest
    extends Thread
{
    protected int datasize = 4000000;

    public TableTest(int datasize)
    {
        this.datasize = datasize;
    }

    public void run()
    {
        try {
            FileInputStream    random = new FileInputStream("/dev/urandom");
            byte[]    key = new byte[32];
            byte[]  data = new byte[datasize];

            HBaseConfiguration hbaseConfiguration = new
HBaseConfiguration();
            HTable table = new HTable(hbaseConfiguration, "fc_test");
            random.read(data, 0, datasize);

            while (true)
            {
                random.read(key, 0, 32);
                Put update = new Put(Bytes.toBytes(getHexString(key)));
                update.add(Bytes.toBytes("data"), Bytes.toBytes(""),
                    data);
                table.put(update);
            }
        } catch (Exception e) {
            System.out.println(e.toString());
        }

    }

    public static String getHexString(byte[] b) throws Exception
    {
          String result = "";
          for (int i=0; i < b.length; i++)
          {
            result +=
                Integer.toString( ( b[i] & 0xff ) + 0x100,
16).substring( 1 );
          }
         return result;
    }

    public static void main(String args[]) throws IOException
    {
        Thread t = null;
        int i;

        int threads = 30;
        int datasize = 400000;

        try {
            threads = Integer.getInteger(args[0]).intValue();
            datasize = Integer.getInteger(args[1]).intValue();
        } catch (Exception e) {
        }

        System.out.println("Starting " + threads + " threads");

        for (i = 0; i< threads; i++)
        {
               t = new TableTest(datasize);
               t.start();
        }
        // wait for last thread
        try {
           t.join();
        } catch (InterruptedException e) {
        }
    }
}


Re: HBase Exceptions on version 0.20.1

by Andrew Purtell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> 239 "Block blk_-xxx is not valid errors",
> 522 "BlockInfo not found in volumeMap" errors,
> and 208 "BlockAlreadyExistsException"

I assume since you say they were found in Hadoop logs that these
appeared in the datanode and/or namenode logs. If not, and instead
these are from HBase logs, please correct my understanding. It seems to me that your HDFS is sick. That's not particularly helpful, I know, but HBase is a client application of HDFS and depends on its good functioning. Have you talked with anyone on or mailed the logs to hdfs-user@... ? If so, what did they say?

> Are there plans to make hbase more resilient to load based failures?

Yes, this is definitely something we have done and continue to do. It's hard if you can't trust your filesystem.

Once I tried running HBase on top of KFS instead of HDFS. KFS did seem slower as is the conventional wisdom but I had bigger problems... the chunkservers would randomly abort on my x86_64 nodes and even after I gave Sriram system access for gdb stack dumps, there was no clear resolution. On the other hand, if you get it working, it has working sync and append. HDFS won't have a fully working sync until 0.21. YMMV.

   - Andy





________________________________
From: elsif <elsif.then@...>
To: hbase-user@...
Sent: Wed, October 21, 2009 8:16:40 AM
Subject: Re: HBase Exceptions on version 0.20.1


While running the test on this cluster of 14 servers, the highest loads
I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the
node serving the .META. region.  All the machines are on a single
gigabit switch dedicated to the cluster.  The highest throughput between
nodes has been 21.4MBps Rx on the node hosting the .META. region.

There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
in the hadoop logs over 12 hours of running the test.

I understand that I am loading the cluster - that is the point of the
test, but I don't think that this should result in data loss.  Failed
inserts at the client level I can handle, but loss of data that was
previously thought to be stored in hbase is a major issue.  Are there
plans to make hbase more resilient to load based failures?

Regards,
elsif

Andrew Purtell wrote:

> The reason JG points to load as being a problem as all signs point to it: This is usually the culprit behind DFS "no live block" errors -- the namenode is too busy and/or falling behind, or the datanodes are falling behind, or actually failing. Also, in the log snippets you provide, HBase is complaining about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly indicative of load, write load. Shortly after this, Zookeeper sessions begin expiring, which is also usually indicative of overloading -- heartbeats miss their deadline.
>
> When I see these signs on my test clusters, I/O wait is generally in excess of 40%.
>
> If your total CPU load is really just a few % (user + system + iowait), then I'd suggest you look at the storage layer. Is there anything in the datanode logs that seems like it might be relevant?
>
> What about the network? Gigabit? Any potential sources of contention? Are you tracking network utilization metrics during the test?
>
> Also, you might consider using Ganglia to monitor and correlate system metrics and HBase and HDFS metrics during your testing, if you are not doing this already.
>
>    - Andy
>
>  


     

Re: HBase Exceptions on version 0.20.1

by Andrew Purtell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please ignore my earlier response. That does look suspiciously like client error.

Can you correlate the block /mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724 to a file? And this file to a region? What is in the master log as relates to this region around the time of the block errors?

   - Andy




________________________________
From: elsif <elsif.then@...>
To: hbase-user@...
Sent: Wed, October 21, 2009 2:36:12 PM
Subject: Re: HBase Exceptions on version 0.20.1

stack wrote:

> On Wed, Oct 21, 2009 at 8:16 AM, elsif <elsif.then@...> wrote:
>
>  
>> There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
>> found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
>> in the hadoop logs over 12 hours of running the test.
>>
>>    
>
> Above are from application-level (hbase) or datanode logs?  If you trace any
> of the above -- follow the block name -- in the NN are the blocks lost or do
> you see replicas taking over or recoveries triggered?
>
>
>  
The above error counts are from the datanode logs.  It looks like blocks
are being requested after they are deleted (HBASE-1894):

hdfs:2009-10-21 11:30:55,580 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.1.29.12:50010,
storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075,
ipcPort=50020):Transmitted block blk_-6090437519946022724_469173 to
/10.1.29.20:50010

hdfs:2009-10-21 11:30:59,938 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block
blk_-6090437519946022724_469173 file
/mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724

hdfs:2009-10-21 11:59:09,509 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.1.29.12:50010,
storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075,
ipcPort=50020):Got exception while serving
blk_-6090437519946022724_469173 to /10.1.29.14:
hdfs:java.io.IOException: Block blk_-6090437519946022724_469173 is not
valid.

>> I understand that I am loading the cluster - that is the point of the
>> test, but I don't think that this should result in data loss.  Failed
>> inserts at the client level I can handle, but loss of data that was
>> previously thought to be stored in hbase is a major issue.  Are there
>> plans to make hbase more resilient to load based failures?
>>
>>
>>    
> It looks like there'll be data loss going by a few of the exceptions you
> provide originally.  Here's a couple of comments:
>
> "No live nodes contain current block"  Usually we see this if the
> client-side hadoop has not been patched with hdfs-127/hadoop-4681.  Your
> test program doesn't seem to have come across.  Mind attaching it to an
> issue so I can try it?  Going by the way you started your test program, you
> should have the hbase patched hadoop first in your CLASSPATH so you should
> be ok but maybe there is something about your environmnent frustrating
> hbase's using a patched hadoop?
>
>  
The test program is short and is in-line at the end of this message.  In
this last run, the client was running against the hadoop-0.20.1-core.jar
instead of the hadoop-0.20.1-hdfs127-core.jar.  I will retest using the
hdfs-127 jar.
> "java.io.IOException: TIMED OUT" Your regionserver or master timed out its
> zk session.  GC or swapping or disk used by zk is under heavy i/o loading?
>  
There are five zk servers in the test setup that share hardware with
nodes that are also datanodes and regionservers.  There are only 8
instances of this error and all of them occurred within 30 seconds of
each other on the same node.
> "ClosedChannelException" Probably symptom of a RS shutdown because of events
> such as above.
>
> "Abandoning block..." Did this write to HLog fail? Its just an INFO level
> log out of DFSClient.
>  
Since these are just INFO level messages, they are not an issue?

hbase:2009-10-21 11:40:57,082 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-1365065510928989381_470690
hdfs:2009-10-21 11:40:57,080 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-1365065510928989381_470690 src: /10.1.29.23:56117 dest:
/10.1.29.23:50010
hdfs:2009-10-21 11:40:57,081 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-1365065510928989381_470690 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/eac8b34a-a4a4-495b-8241-43cd4b21c28f/current/blk_-8854288022995097022_462986.meta':
No such file or directory

hbase:2009-10-21 11:36:12,609 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_4473876596208463127_470012
hdfs:2009-10-21 11:35:44,842 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.22:57916 dest:
/10.1.29.14:50010
hdfs:2009-10-21 11:35:44,844 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_4473876596208463127_470012 received exception java.io.EOFException
hdfs:2009-10-21 11:36:26,105 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.14:34537 dest:
/10.1.29.21:50010
hdfs:2009-10-21 11:36:26,105 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_4473876596208463127_470012 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/e9726517-c84c-4326-8859-44f8261a14fa/current/subdir26/subdir62/blk_4151284369367444478':
No such file or directory
hdfs:2009-10-21 11:36:12,605 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_4473876596208463127_470012 src: /10.1.29.22:34103 dest:
/10.1.29.22:50010
hdfs:2009-10-21 11:36:12,609 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_4473876596208463127_470012 2 Exception java.io.EOFException
hdfs:2009-10-21 11:36:12,609 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
block blk_4473876596208463127_470012 terminating

hbase:2009-10-21 06:47:10,189 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_-6810931018745175038_422166
hdfs:2009-10-21 06:47:10,171 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-6810931018745175038_422166 src: /10.1.29.12:44933 dest:
/10.1.29.12:50010
hdfs:2009-10-21 06:47:10,188 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-6810931018745175038_422166 received exception java.io.EOFException
hdfs:2009-10-21 06:46:43,362 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-6810931018745175038_422166 src: /10.1.29.12:48908 dest:
/10.1.29.14:50010
hdfs:2009-10-21 06:46:43,362 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-6810931018745175038_422166 received exception
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
`/mnt/1a3e57e4-b514-4a85-83b1-ae9d12e3fb7a/current/subdir27/blk_-3769845572103484957_410554.meta':
No such file or directory

> "file system not available" What happened before this?  Was this just an
> emission on regionserver shutdown?
>  
I don't have any "file system not available" messages this time around.
> St.Ack
>
>  
Here is the source for the test program.  For my tests I run two
instances with the default settings:

--

package org.apache.hadoop.hbase.util;

import java.io.IOException;
import java.io.FileInputStream;
import java.lang.Integer;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;

/* create 'fc_test', {NAME => 'json', VERSIONS => 1}, {NAME => 'data',
VERSIONS =>1} */

public class LoadTest
    extends Thread
{
    protected int datasize = 4000000;

    public TableTest(int datasize)
    {
        this.datasize = datasize;
    }

    public void run()
    {
        try {
            FileInputStream    random = new FileInputStream("/dev/urandom");
            byte[]    key = new byte[32];
            byte[]  data = new byte[datasize];

            HBaseConfiguration hbaseConfiguration = new
HBaseConfiguration();
            HTable table = new HTable(hbaseConfiguration, "fc_test");
            random.read(data, 0, datasize);

            while (true)
            {
                random.read(key, 0, 32);
                Put update = new Put(Bytes.toBytes(getHexString(key)));
                update.add(Bytes.toBytes("data"), Bytes.toBytes(""),
                    data);
                table.put(update);
            }
        } catch (Exception e) {
            System.out.println(e.toString());
        }

    }

    public static String getHexString(byte[] b) throws Exception
    {
          String result = "";
          for (int i=0; i < b.length; i++)
          {
            result +=
                Integer.toString( ( b[i] & 0xff ) + 0x100,
16).substring( 1 );
          }
         return result;
    }

    public static void main(String args[]) throws IOException
    {
        Thread t = null;
        int i;

        int threads = 30;
        int datasize = 400000;

        try {
            threads = Integer.getInteger(args[0]).intValue();
            datasize = Integer.getInteger(args[1]).intValue();
        } catch (Exception e) {
        }

        System.out.println("Starting " + threads + " threads");

        for (i = 0; i< threads; i++)
        {
               t = new TableTest(datasize);
               t.start();
        }
        // wait for last thread
        try {
           t.join();
        } catch (InterruptedException e) {
        }
    }
}


     

Re: HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Looks like the file was part of a compaction:

2009-10-21 11:58:38,434 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on
region
fc_test,d46a332f60405431f0206d47143044676f1ff15239f24cb72c9743145d15f3e3,1256134371826
2009-10-21 11:58:38,442 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Skipped compaction of 1
file; compaction size of data: 173.6m; Skipped 3 files, size: 176812375
2009-10-21 11:58:38,451 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region
fc_test,d46a332f60405431f0206d47143044676f1ff15239f24cb72c9743145d15f3e3,1256134371826
in 0sec
2009-10-21 11:58:38,451 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on
region
fc_test,751960d5f43b5e53e806a89afbbd0b819f30de7369db613413022d99a77b54b8,1256135307641
2009-10-21 11:58:38,459 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Compaction size of data:
170.2m; Skipped 1 file(s), size: 131805190
2009-10-21 11:58:38,459 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Started compaction of 3
file(s)  into /hbase/fc_test/compaction.dir/909519914, seqid=113712
2009-10-21 11:58:38,913 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-6090437519946022724_469173 from any node:
java.io.IOException: No live nodes contain current block

I don't see anything in the master log relating to this:

2009-10-21 11:58:31,254 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 1406 row(s) of meta region {server:
10.1.29.20:60020,
 regionname: .META.,,1, startKey: <>}
complete                                
2009-10-21 11:58:31,254 INFO org.apache.hadoop.hbase.master.BaseScanner:
All 1 .
META. region(s) scanned  
2009-10-21 11:59:26,548 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server:
10.1.29.20:60020, regionname: .META.,,1, startKey: <>}


Andrew Purtell wrote:
> Please ignore my earlier response. That does look suspiciously like client error.
>
> Can you correlate the block /mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724 to a file? And this file to a region? What is in the master log as relates to this region around the time of the block errors?
>
>    - Andy
>
>  


Re: HBase Exceptions on version 0.20.1

by Andrew Purtell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks.

Can you grep your master log for all lines including
fc_test,751960d5f43b5e53e806a89afbbd0b819f30de7369db613413022d99a77b54b8,1256135307641 ?

Best regards,

   - Andy




________________________________
From: elsif <elsif.then@...>
To: hbase-user@...
Sent: Fri, October 23, 2009 12:45:23 PM
Subject: Re: HBase Exceptions on version 0.20.1

Looks like the file was part of a compaction:

2009-10-21 11:58:38,434 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on
region
fc_test,d46a332f60405431f0206d47143044676f1ff15239f24cb72c9743145d15f3e3,1256134371826
2009-10-21 11:58:38,442 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Skipped compaction of 1
file; compaction size of data: 173.6m; Skipped 3 files, size: 176812375
2009-10-21 11:58:38,451 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region
fc_test,d46a332f60405431f0206d47143044676f1ff15239f24cb72c9743145d15f3e3,1256134371826
in 0sec
2009-10-21 11:58:38,451 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on
region
fc_test,751960d5f43b5e53e806a89afbbd0b819f30de7369db613413022d99a77b54b8,1256135307641
2009-10-21 11:58:38,459 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Compaction size of data:
170.2m; Skipped 1 file(s), size: 131805190
2009-10-21 11:58:38,459 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Started compaction of 3
file(s)  into /hbase/fc_test/compaction.dir/909519914, seqid=113712
2009-10-21 11:58:38,913 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-6090437519946022724_469173 from any node:
java.io.IOException: No live nodes contain current block

I don't see anything in the master log relating to this:

2009-10-21 11:58:31,254 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 1406 row(s) of meta region {server:
10.1.29.20:60020,
regionname: .META.,,1, startKey: <>}
complete                                
2009-10-21 11:58:31,254 INFO org.apache.hadoop.hbase.master.BaseScanner:
All 1 .
META. region(s) scanned  
2009-10-21 11:59:26,548 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server:
10.1.29.20:60020, regionname: .META.,,1, startKey: <>}


Andrew Purtell wrote:
> Please ignore my earlier response. That does look suspiciously like client error.
>
> Can you correlate the block /mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724 to a file? And this file to a region? What is in the master log as relates to this region around the time of the block errors?
>
>    - Andy
>
>  


     

Re: HBase Exceptions on version 0.20.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I ran your program elsif against our cluster.  It ran for about two days,
loaded up 1500 regions into 3-node cluster and then each of the RS's
variously timed out against zookeeper and shut themselves down.   I'll take
a look at why the session timeout.   The upload is pretty slow.  My guess is
that the client with its 30 threads is contending over the map of region
locations (Ryan noted the last day that all writes are blocked while the
search for new region location is being done).  I can take a look into this
too.

St.Ack


On Tue, Oct 20, 2009 at 2:12 PM, elsif <elsif.then@...> wrote:

>
> Hello Everyone,
>
> We are experiencing many HBase exceptions while running the attached
> test program using
> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
> with 14 data nodes.
>
> The HBase operational performance degrades quickly once certain error
> conditions are
> encountered especially if there are issues with the META region.
>
> Has anyone else seen these exceptions listed below?
>
> We ran the test with 3 region servers and then with 14 region servers.
> In both cases, we still got exceptions.
>
> In each test run, we shutdown HBase, removed the main /hbase
> folder, restarted HBase and then ran the test program.
> The test program uses a table named "fc_test". The program is started by
> the following command:
>
> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>
> The attached HBase test program starts 30 threads. Each thread loops,
> inserting randomly generated keys
> along with a fixed data pattern of 4,000,000 bytes.
>
> After running the test for 2-12 hours, the table scanning of "fc_test"
> is broken and only a limited number of put
> operations are still succeeding.
>
>
>
> EXCEPTION LIST
>
> ------------------------------
> 2009-10-15 03:28:39,005 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
> on store data; time since last major compaction 89677908ms
> 2009-10-15 03:28:39,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-6054029382496843671_324379 from any node:
> java.io.IOException: No live nodes contain current block
> ------------------------------
> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x2244f6194a5000d after 1ms
> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
> <mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
> java.io.IOException: TIMED OUT
>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-10-15 07:41:54,344 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server hfs-029014/10.1.29.14:2181
> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
> ------------------------------
> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>    at
>
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>    at
>
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> ------------------------------
> 2009-10-15 07:57:28,025 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
> Evicted/Run=NaN
> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in createBlockOutputStream java.io.EOFException
> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7166000434569842394_332632
> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
> editcount=146
> -----------------------------
> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
> while closing send thread for session 0x1244f6194780008 : Read error rc
> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> Disconnecting ClientCnxn for session: 0x1244f6194780008
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> ------------------------------
> 2009-10-13 12:30:58,207 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
> master address from ZooKeeper. Retrying. Error was:
> java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /hbase/master
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>    at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /hbase/master
>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
> ------------------------------
> 2009-10-14 15:12:42,664 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>    at java.lang.reflect.Constructor.newInstance(Unknown Source)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>    at java.lang.Thread.run(Unknown Source)
> ------------------------------
> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
> <mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
> java.io.IOException: Session Expired
>    at
>
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2009-10-14 15:12:43,567 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2009-10-14 15:12:43,568 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> ------------------------------
> 2009-10-14 15:12:51,563 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Cannot append; log is closed
>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
> cancel cleanup thread, though no FileSystems are open
> 2009-10-14 15:13:51,864 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>    at
>
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
> failed on local exception: java.nio.channels.ClosedByInterruptException
>    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>    at org.apache.hadoop.ipc.Client.call(Client.java:742)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>    at $Proxy1.getFileInfo(Unknown Source)
>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>    at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>    at $Proxy1.getFileInfo(Unknown Source)
>    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>    at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>    at
>
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>    ... 9 more
> Caused by: java.nio.channels.ClosedByInterruptException
>    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
> Source)
>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>    at
>
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>    at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>    at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>    at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>    at java.io.BufferedOutputStream.flush(Unknown Source)
>    at java.io.DataOutputStream.flush(Unknown Source)
>    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>    at org.apache.hadoop.ipc.Client.call(Client.java:720)
> ------------------------------
>
> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>
> 2009-10-13 00:00:00,508 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
> out of range for ...
>    ....
>    at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>    at
> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020, call put([B@1f5b92a,
> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>  at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>   at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-10-12 10:06:57,812 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020, call put([B@1f52982,
> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>   at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
>
>
>
>
>
> .
>
>

Re: HBase Exceptions on version 0.20.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

This second run has been going for > 24 hours now.  Its loaded up near 2k
regions.  I'm using defaults from your script which is 30 threads and 400k
cells.   For 400k cells, I should probably be making adjustments upping the
flush and region size.  I'm also noticing that all load seems to land on a
single regionserver.  Did you notice that?  Maybe I should use something
other than java Random?  Maybe I should md5 the Random output?

But I just noticed that you ran with 4M cells.  4M cells are pretty big.
HBase defaults are for cells much smaller.

So I killed the above test and am now trying w/ 4M cells.  I see that we're
flushing every 16 or so edits.  We're spending loads of our time just
flushing and rolling WAL logs because the cells are all so large.  Are 4M
cells your use case?  If so, I'll try and get you tunings that'll work
better for these cell sizes.

I've been watching ZooKeeper over last 24 hours.  The average latency
serving zk response is about 5ms.  The longest response is 120ms (I see
these stats by sending the quorum members the 'stat' command as in "echo
stat|nc HOST PORT").  This time around I gave zk its own disk.  My guess is
that when I saw the TIMED OUT yesterday, it was probably zk client-side
issue.. a long GC or so.  I have GC logging running at the moment so will be
able to correlate should it happen again.

As for things slowing down, this third time loading I added in ganglia
logging.  Will let you know.

As to the exceptions you were seeing, here's some more on those:

On the "INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block...", I
see those in my testing.  The logs are at INFO level.  They should probably
be WARN-level instead.  See the code below from DFSClient.  The map of nodes
carrying block replicas can shift while DFSClient is open (Looking at one
particular block from uploading I see that the local block was removed from
local datanode because the block was over-replicated).  When we try to get
the local block, we can fail.  Its logged as "No live nodes contain current
block".  The current DFSClient map doesn't contain a node for the block.  As
you can see in the code, we'll then ask NameNode for new map of blocks for
this file.  As long as we get the block before cycle > failures, we'll see
INFO-level loggings (For some reason I see 3 loggings often before we'll
succeed and get the block).  Its a failure if DFSClient throws an IOE out of
here.  I'm just seeing the INFO loggings in my testing.  On this second test
I added logging to DFSClient to see if I can figure why it takes 3 goes
often to find the right block.

    private DNAddrPair chooseDataNode(LocatedBlock block)
      throws IOException {
      while (true) {
        DatanodeInfo[] nodes = block.getLocations();
        try {
          DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
          InetSocketAddress targetAddr =
                            NetUtils.createSocketAddr(chosenNode.getName());
          return new DNAddrPair(chosenNode, targetAddr);
        } catch (IOException ie) {
          String blockInfo = block.getBlock() + " file=" + src;
          if (failures >= maxBlockAcquireFailures) {
            throw new IOException("Could not obtain block: " + blockInfo);
          }

          if (nodes == null || nodes.length == 0) {
            LOG.info("No node available for block: " + blockInfo);
          }
          LOG.info("Could not obtain block " + block.getBlock() + " from any
node:  " + ie);
          try {
            Thread.sleep(3000);
          } catch (InterruptedException iex) {
          }
          deadNodes.clear(); //2nd option is to remove only nodes[blockId]
          openInfo();
          failures++;
          continue;
        }
      }

St.Ack


On Tue, Oct 20, 2009 at 2:12 PM, elsif <elsif.then@...> wrote:

>
> Hello Everyone,
>
> We are experiencing many HBase exceptions while running the attached
> test program using
> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
> with 14 data nodes.
>
> The HBase operational performance degrades quickly once certain error
> conditions are
> encountered especially if there are issues with the META region.
>
> Has anyone else seen these exceptions listed below?
>
> We ran the test with 3 region servers and then with 14 region servers.
> In both cases, we still got exceptions.
>
> In each test run, we shutdown HBase, removed the main /hbase
> folder, restarted HBase and then ran the test program.
> The test program uses a table named "fc_test". The program is started by
> the following command:
>
> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>
> The attached HBase test program starts 30 threads. Each thread loops,
> inserting randomly generated keys
> along with a fixed data pattern of 4,000,000 bytes.
>
> After running the test for 2-12 hours, the table scanning of "fc_test"
> is broken and only a limited number of put
> operations are still succeeding.
>
>
>
> EXCEPTION LIST
>
> ------------------------------
> 2009-10-15 03:28:39,005 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
> on store data; time since last major compaction 89677908ms
> 2009-10-15 03:28:39,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_-6054029382496843671_324379 from any node:
> java.io.IOException: No live nodes contain current block
> ------------------------------
> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x2244f6194a5000d after 1ms
> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
> <mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
> java.io.IOException: TIMED OUT
>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-10-15 07:41:54,344 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server hfs-029014/10.1.29.14:2181
> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
> ------------------------------
> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>    at
>
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>    at
>
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
> ------------------------------
> 2009-10-15 07:57:28,025 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
> Evicted/Run=NaN
> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in createBlockOutputStream java.io.EOFException
> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7166000434569842394_332632
> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
> editcount=146
> -----------------------------
> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
> while closing send thread for session 0x1244f6194780008 : Read error rc
> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> Disconnecting ClientCnxn for session: 0x1244f6194780008
> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> ------------------------------
> 2009-10-13 12:30:58,207 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
> master address from ZooKeeper. Retrying. Error was:
> java.io.IOException:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /hbase/master
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>    at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /hbase/master
>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>    at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
> ------------------------------
> 2009-10-14 15:12:42,664 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source)
>    at java.lang.reflect.Constructor.newInstance(Unknown Source)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>    at
>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>    at java.lang.Thread.run(Unknown Source)
> ------------------------------
> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
> <mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
> java.io.IOException: Session Expired
>    at
>
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2009-10-14 15:12:43,567 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2009-10-14 15:12:43,568 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> ------------------------------
> 2009-10-14 15:12:51,563 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Cannot append; log is closed
>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
> cancel cleanup thread, though no FileSystems are open
> 2009-10-14 15:13:51,864 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>    at
>
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
> failed on local exception: java.nio.channels.ClosedByInterruptException
>    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>    at org.apache.hadoop.ipc.Client.call(Client.java:742)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>    at $Proxy1.getFileInfo(Unknown Source)
>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>    at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>    at $Proxy1.getFileInfo(Unknown Source)
>    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>    at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>    at
>
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>    ... 9 more
> Caused by: java.nio.channels.ClosedByInterruptException
>    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
> Source)
>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>    at
>
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>    at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>    at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>    at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>    at java.io.BufferedOutputStream.flush(Unknown Source)
>    at java.io.DataOutputStream.flush(Unknown Source)
>    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>    at org.apache.hadoop.ipc.Client.call(Client.java:720)
> ------------------------------
>
> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>
> 2009-10-13 00:00:00,508 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
> out of range for ...
>    ....
>    at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>    at
> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020, call put([B@1f5b92a,
> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>  at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>   at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-10-12 10:06:57,812 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020, call put([B@1f52982,
> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>    at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>   at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>  at java.lang.reflect.Method.invoke(Unknown Source)
>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>  at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> ------------------------------
>
>
>
>
>
> .
>
>

Re: HBase Exceptions on version 0.20.1

by elsif :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On my last set of tests I used two instances of the test program - 60
threads total generating rows containing 400kB of data.  The highest
load appeared to be on the region serving meta.

On my cluster things started to break down at about 20,000 regions with
a heap size of 6MB.  You can always lower your heap size to generate
issues quicker.

Thanks,
elsif

stack wrote:

> This second run has been going for > 24 hours now.  Its loaded up near 2k
> regions.  I'm using defaults from your script which is 30 threads and 400k
> cells.   For 400k cells, I should probably be making adjustments upping the
> flush and region size.  I'm also noticing that all load seems to land on a
> single regionserver.  Did you notice that?  Maybe I should use something
> other than java Random?  Maybe I should md5 the Random output?
>
> But I just noticed that you ran with 4M cells.  4M cells are pretty big.
> HBase defaults are for cells much smaller.
>
> So I killed the above test and am now trying w/ 4M cells.  I see that we're
> flushing every 16 or so edits.  We're spending loads of our time just
> flushing and rolling WAL logs because the cells are all so large.  Are 4M
> cells your use case?  If so, I'll try and get you tunings that'll work
> better for these cell sizes.
>
> I've been watching ZooKeeper over last 24 hours.  The average latency
> serving zk response is about 5ms.  The longest response is 120ms (I see
> these stats by sending the quorum members the 'stat' command as in "echo
> stat|nc HOST PORT").  This time around I gave zk its own disk.  My guess is
> that when I saw the TIMED OUT yesterday, it was probably zk client-side
> issue.. a long GC or so.  I have GC logging running at the moment so will be
> able to correlate should it happen again.
>
> As for things slowing down, this third time loading I added in ganglia
> logging.  Will let you know.
>
> As to the exceptions you were seeing, here's some more on those:
>
> On the "INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block...", I
> see those in my testing.  The logs are at INFO level.  They should probably
> be WARN-level instead.  See the code below from DFSClient.  The map of nodes
> carrying block replicas can shift while DFSClient is open (Looking at one
> particular block from uploading I see that the local block was removed from
> local datanode because the block was over-replicated).  When we try to get
> the local block, we can fail.  Its logged as "No live nodes contain current
> block".  The current DFSClient map doesn't contain a node for the block.  As
> you can see in the code, we'll then ask NameNode for new map of blocks for
> this file.  As long as we get the block before cycle > failures, we'll see
> INFO-level loggings (For some reason I see 3 loggings often before we'll
> succeed and get the block).  Its a failure if DFSClient throws an IOE out of
> here.  I'm just seeing the INFO loggings in my testing.  On this second test
> I added logging to DFSClient to see if I can figure why it takes 3 goes
> often to find the right block.
>
>     private DNAddrPair chooseDataNode(LocatedBlock block)
>       throws IOException {
>       while (true) {
>         DatanodeInfo[] nodes = block.getLocations();
>         try {
>           DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
>           InetSocketAddress targetAddr =
>                             NetUtils.createSocketAddr(chosenNode.getName());
>           return new DNAddrPair(chosenNode, targetAddr);
>         } catch (IOException ie) {
>           String blockInfo = block.getBlock() + " file=" + src;
>           if (failures >= maxBlockAcquireFailures) {
>             throw new IOException("Could not obtain block: " + blockInfo);
>           }
>
>           if (nodes == null || nodes.length == 0) {
>             LOG.info("No node available for block: " + blockInfo);
>           }
>           LOG.info("Could not obtain block " + block.getBlock() + " from any
> node:  " + ie);
>           try {
>             Thread.sleep(3000);
>           } catch (InterruptedException iex) {
>           }
>           deadNodes.clear(); //2nd option is to remove only nodes[blockId]
>           openInfo();
>           failures++;
>           continue;
>         }
>       }
>
> St.Ack
>
>
> On Tue, Oct 20, 2009 at 2:12 PM, elsif <elsif.then@...> wrote:
>
>  
>> Hello Everyone,
>>
>> We are experiencing many HBase exceptions while running the attached
>> test program using
>> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
>> with 14 data nodes.
>>
>> The HBase operational performance degrades quickly once certain error
>> conditions are
>> encountered especially if there are issues with the META region.
>>
>> Has anyone else seen these exceptions listed below?
>>
>> We ran the test with 3 region servers and then with 14 region servers.
>> In both cases, we still got exceptions.
>>
>> In each test run, we shutdown HBase, removed the main /hbase
>> folder, restarted HBase and then ran the test program.
>> The test program uses a table named "fc_test". The program is started by
>> the following command:
>>
>> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>>
>> The attached HBase test program starts 30 threads. Each thread loops,
>> inserting randomly generated keys
>> along with a fixed data pattern of 4,000,000 bytes.
>>
>> After running the test for 2-12 hours, the table scanning of "fc_test"
>> is broken and only a limited number of put
>> operations are still succeeding.
>>
>>
>>
>> EXCEPTION LIST
>>
>> ------------------------------
>> 2009-10-15 03:28:39,005 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
>> on store data; time since last major compaction 89677908ms
>> 2009-10-15 03:28:39,012 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
>> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
>> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-6054029382496843671_324379 from any node:
>> java.io.IOException: No live nodes contain current block
>> ------------------------------
>> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> response for sessionid:0x2244f6194a5000d after 1ms
>> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
>> <mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
>> java.io.IOException: TIMED OUT
>>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> 2009-10-15 07:41:54,344 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Disconnected, type: None, path: null
>> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
>> connection to server hfs-029014/10.1.29.14:2181
>> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
>> connection to java.nio.channels.SocketChannel[connected
>> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
>> ------------------------------
>> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>> ------------------------------
>> 2009-10-15 07:57:28,025 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
>> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
>> Evicted/Run=NaN
>> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> in createBlockOutputStream java.io.EOFException
>> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_7166000434569842394_332632
>> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
>> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
>> editcount=146
>> -----------------------------
>> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
>> while closing send thread for session 0x1244f6194780008 : Read error rc
>> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> Disconnecting ClientCnxn for session: 0x1244f6194780008
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> EventThread shut down
>> ------------------------------
>> 2009-10-13 12:30:58,207 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
>> master address from ZooKeeper. Retrying. Error was:
>> java.io.IOException:
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /hbase/master
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>>    at java.lang.Thread.run(Unknown Source)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /hbase/master
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
>> ------------------------------
>> 2009-10-14 15:12:42,664 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>>    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>> Source)
>>    at java.lang.reflect.Constructor.newInstance(Unknown Source)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>    at java.lang.Thread.run(Unknown Source)
>> ------------------------------
>> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
>> <mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
>> java.io.IOException: Session Expired
>>    at
>>
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> 2009-10-14 15:12:43,567 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Expired, type: None, path: null
>> 2009-10-14 15:12:43,568 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
>> expired
>> ------------------------------
>> 2009-10-14 15:12:51,563 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> java.io.IOException: Cannot append; log is closed
>>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
>> cancel cleanup thread, though no FileSystems are open
>> 2009-10-14 15:13:51,864 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>> HRegionServer: file system not available
>> java.io.IOException: File system is not available
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
>> failed on local exception: java.nio.channels.ClosedByInterruptException
>>    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:742)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>>    at
>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>>    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>>    ... 9 more
>> Caused by: java.nio.channels.ClosedByInterruptException
>>    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
>> Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>>
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>>    at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>>    at java.io.BufferedOutputStream.flush(Unknown Source)
>>    at java.io.DataOutputStream.flush(Unknown Source)
>>    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:720)
>> ------------------------------
>>
>> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>>
>> 2009-10-13 00:00:00,508 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
>> out of range for ...
>>    ....
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>>    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 7 on 60020, call put([B@1f5b92a,
>> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>  at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> 2009-10-12 10:06:57,812 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
>> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
>> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
>> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 3 on 60020, call put([B@1f52982,
>> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>>
>>
>>
>>
>>
>> .
>>
>>
>>    
>
>  


Re: HBase Exceptions on version 0.20.1

by Andrew Purtell-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

So that's about 1500 regions per region server.  Should be more like 200-250 I think because as a practical matter the recovery time for redeploying 1500 regions upon hardware failure will block clients for a long time. It may not even be possible if all region servers are almost at the point of overcommittment when this large chunk of regions suddenly comes up for reassignment. AFAIK, Google's BigTable is tuned for 100 per region server, 200 MB per region. This is to minimize churn and region unavailability time should a server fail and also allows the node to serve the data mostly out of RAM.

This is with default max file size (256 MB)? What happens if you up that to 1 GB? Set hbase.hregion.max.filesize property in hbase-site.xml to make this change globally.

    - Andy




________________________________
From: elsif <elsif.then@...>
To: hbase-user@...
Sent: Wed, November 11, 2009 6:44:34 AM
Subject: Re: HBase Exceptions on version 0.20.1

On my last set of tests I used two instances of the test program - 60
threads total generating rows containing 400kB of data.  The highest
load appeared to be on the region serving meta.

On my cluster things started to break down at about 20,000 regions with
a heap size of 6MB.  You can always lower your heap size to generate
issues quicker.

Thanks,
elsif

stack wrote:

> This second run has been going for > 24 hours now.  Its loaded up near 2k
> regions.  I'm using defaults from your script which is 30 threads and 400k
> cells.   For 400k cells, I should probably be making adjustments upping the
> flush and region size.  I'm also noticing that all load seems to land on a
> single regionserver.  Did you notice that?  Maybe I should use something
> other than java Random?  Maybe I should md5 the Random output?
>
> But I just noticed that you ran with 4M cells.  4M cells are pretty big.
> HBase defaults are for cells much smaller.
>
> So I killed the above test and am now trying w/ 4M cells.  I see that we're
> flushing every 16 or so edits.  We're spending loads of our time just
> flushing and rolling WAL logs because the cells are all so large.  Are 4M
> cells your use case?  If so, I'll try and get you tunings that'll work
> better for these cell sizes.
>
> I've been watching ZooKeeper over last 24 hours.  The average latency
> serving zk response is about 5ms.  The longest response is 120ms (I see
> these stats by sending the quorum members the 'stat' command as in "echo
> stat|nc HOST PORT").  This time around I gave zk its own disk.  My guess is
> that when I saw the TIMED OUT yesterday, it was probably zk client-side
> issue.. a long GC or so.  I have GC logging running at the moment so will be
> able to correlate should it happen again.
>
> As for things slowing down, this third time loading I added in ganglia
> logging.  Will let you know.
>
> As to the exceptions you were seeing, here's some more on those:
>
> On the "INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block...", I
> see those in my testing.  The logs are at INFO level.  They should probably
> be WARN-level instead.  See the code below from DFSClient.  The map of nodes
> carrying block replicas can shift while DFSClient is open (Looking at one
> particular block from uploading I see that the local block was removed from
> local datanode because the block was over-replicated).  When we try to get
> the local block, we can fail.  Its logged as "No live nodes contain current
> block".  The current DFSClient map doesn't contain a node for the block.  As
> you can see in the code, we'll then ask NameNode for new map of blocks for
> this file.  As long as we get the block before cycle > failures, we'll see
> INFO-level loggings (For some reason I see 3 loggings often before we'll
> succeed and get the block).  Its a failure if DFSClient throws an IOE out of
> here.  I'm just seeing the INFO loggings in my testing.  On this second test
> I added logging to DFSClient to see if I can figure why it takes 3 goes
> often to find the right block.
>
>     private DNAddrPair chooseDataNode(LocatedBlock block)
>       throws IOException {
>       while (true) {
>         DatanodeInfo[] nodes = block.getLocations();
>         try {
>           DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
>           InetSocketAddress targetAddr =
>                             NetUtils.createSocketAddr(chosenNode.getName());
>           return new DNAddrPair(chosenNode, targetAddr);
>         } catch (IOException ie) {
>           String blockInfo = block.getBlock() + " file=" + src;
>           if (failures >= maxBlockAcquireFailures) {
>             throw new IOException("Could not obtain block: " + blockInfo);
>           }
>
>           if (nodes == null || nodes.length == 0) {
>             LOG.info("No node available for block: " + blockInfo);
>           }
>           LOG.info("Could not obtain block " + block.getBlock() + " from any
> node:  " + ie);
>           try {
>             Thread.sleep(3000);
>           } catch (InterruptedException iex) {
>           }
>           deadNodes.clear(); //2nd option is to remove only nodes[blockId]
>           openInfo();
>           failures++;
>           continue;
>         }
>       }
>
> St.Ack
>
>
> On Tue, Oct 20, 2009 at 2:12 PM, elsif <elsif.then@...> wrote:
>
>  
>> Hello Everyone,
>>
>> We are experiencing many HBase exceptions while running the attached
>> test program using
>> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
>> with 14 data nodes.
>>
>> The HBase operational performance degrades quickly once certain error
>> conditions are
>> encountered especially if there are issues with the META region.
>>
>> Has anyone else seen these exceptions listed below?
>>
>> We ran the test with 3 region servers and then with 14 region servers.
>> In both cases, we still got exceptions.
>>
>> In each test run, we shutdown HBase, removed the main /hbase
>> folder, restarted HBase and then ran the test program.
>> The test program uses a table named "fc_test". The program is started by
>> the following command:
>>
>> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>>
>> The attached HBase test program starts 30 threads. Each thread loops,
>> inserting randomly generated keys
>> along with a fixed data pattern of 4,000,000 bytes.
>>
>> After running the test for 2-12 hours, the table scanning of "fc_test"
>> is broken and only a limited number of put
>> operations are still succeeding.
>>
>>
>>
>> EXCEPTION LIST
>>
>> ------------------------------
>> 2009-10-15 03:28:39,005 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
>> on store data; time since last major compaction 89677908ms
>> 2009-10-15 03:28:39,012 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
>> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
>> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-6054029382496843671_324379 from any node:
>> java.io.IOException: No live nodes contain current block
>> ------------------------------
>> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> response for sessionid:0x2244f6194a5000d after 1ms
>> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x2244f6194a5000d to sun.nio.ch.SelectionKeyImpl@139f5ca
>> <mailto:sun.nio.ch.SelectionKeyImpl@139f5ca>
>> java.io.IOException: TIMED OUT
>>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> 2009-10-15 07:41:54,344 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Disconnected, type: None, path: null
>> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
>> connection to server hfs-029014/10.1.29.14:2181
>> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
>> connection to java.nio.channels.SocketChannel[connected
>> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
>> ------------------------------
>> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>> ------------------------------
>> 2009-10-15 07:57:28,025 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
>> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
>> Evicted/Run=NaN
>> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> in createBlockOutputStream java.io.EOFException
>> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_7166000434569842394_332632
>> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
>> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
>> editcount=146
>> -----------------------------
>> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
>> while closing send thread for session 0x1244f6194780008 : Read error rc
>> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> Disconnecting ClientCnxn for session: 0x1244f6194780008
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> EventThread shut down
>> ------------------------------
>> 2009-10-13 12:30:58,207 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
>> master address from ZooKeeper. Retrying. Error was:
>> java.io.IOException:
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /hbase/master
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>>    at java.lang.Thread.run(Unknown Source)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /hbase/master
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
>> ------------------------------
>> 2009-10-14 15:12:42,664 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>>    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>> Source)
>>    at java.lang.reflect.Constructor.newInstance(Unknown Source)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>    at java.lang.Thread.run(Unknown Source)
>> ------------------------------
>> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x1244f619478000d to sun.nio.ch.SelectionKeyImpl@15e32c4
>> <mailto:sun.nio.ch.SelectionKeyImpl@15e32c4>
>> java.io.IOException: Session Expired
>>    at
>>
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> 2009-10-14 15:12:43,567 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Expired, type: None, path: null
>> 2009-10-14 15:12:43,568 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
>> expired
>> ------------------------------
>> 2009-10-14 15:12:51,563 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> java.io.IOException: Cannot append; log is closed
>>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
>> cancel cleanup thread, though no FileSystems are open
>> 2009-10-14 15:13:51,864 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>> HRegionServer: file system not available
>> java.io.IOException: File system is not available
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
>> failed on local exception: java.nio.channels.ClosedByInterruptException
>>    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:742)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>>    at
>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>>    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>>    ... 9 more
>> Caused by: java.nio.channels.ClosedByInterruptException
>>    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
>> Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>>
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>>    at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>>    at java.io.BufferedOutputStream.flush(Unknown Source)
>>    at java.io.DataOutputStream.flush(Unknown Source)
>>    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:720)
>> ------------------------------
>>
>> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>>
>> 2009-10-13 00:00:00,508 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
>> out of range for ...
>>    ....
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>>    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 7 on 60020, call put([B@1f5b92a,
>> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>  at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> 2009-10-12 10:06:57,812 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
>> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
>> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
>> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 3 on 60020, call put([B@1f52982,
>> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>>
>>
>>
>>
>>
>> .
>>
>>
>>    
>
>