|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Hi,
I am very interesting to the solution that Joey proposed and would like to have a try. Does anyone have any ideas on how to deploy this zk_wrapper in JNI integration? I would be very appreciated. thanks zhong |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)There isnt any working code yet. Just an idea, and a prototype.
There is some sense that if we can get the G1 GC that we could get rid of all long pauses, and avoid the need for this. -ryan On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@...> wrote: > Hi, > > I am very interesting to the solution that Joey proposed and would like to > have a try. > Does anyone have any ideas on how to deploy this zk_wrapper in JNI > integration? > > I would be very appreciated. > > thanks > zhong > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Ryan,
Thank you very much. May I ask whether there are any ways to get around this problem to make HBase more stable? best, zhong On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> wrote: > There isnt any working code yet. Just an idea, and a prototype. > > There is some sense that if we can get the G1 GC that we could get rid > of all long pauses, and avoid the need for this. > > -ryan > > On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@...> > wrote: > > Hi, > > > > I am very interesting to the solution that Joey proposed and would like > to > > have a try. > > Does anyone have any ideas on how to deploy this zk_wrapper in JNI > > integration? > > > > I would be very appreciated. > > > > thanks > > zhong > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Set the ZK timeout to something like 40ms, and give the GC enough Xmx
so you never risk entering the much dreaded concurrent-mode-failure whereby the entire heap must be GCed. Consider testing Java 7 and the G1 GC. We could get a JNI thread to do this, but no one has done so yet. I am personally hoping for G1 and in the meantime overprovision our Xmx to avoid the concurrent mode failures. -ryan On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <zhongresearch@...> wrote: > Ryan, > > Thank you very much. > May I ask whether there are any ways to get around this problem to make > HBase more stable? > > best, > zhong > > > > On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> wrote: > >> There isnt any working code yet. Just an idea, and a prototype. >> >> There is some sense that if we can get the G1 GC that we could get rid >> of all long pauses, and avoid the need for this. >> >> -ryan >> >> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@...> >> wrote: >> > Hi, >> > >> > I am very interesting to the solution that Joey proposed and would like >> to >> > have a try. >> > Does anyone have any ideas on how to deploy this zk_wrapper in JNI >> > integration? >> > >> > I would be very appreciated. >> > >> > thanks >> > zhong >> > >> > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Ryan,
I am very appreciated for your feedbacks. I have set the zookeeper.session.timeout to seconds which is way higher than 40ms. In the same time, the -Xms is set to 4GB, which should be sufficient. I also tried GC options like -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC I even set the vm.swappiness=0 However, I still came across the problem that a RegionServer shutdown itself. Best, zhong On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> wrote: > Set the ZK timeout to something like 40ms, and give the GC enough Xmx > so you never risk entering the much dreaded concurrent-mode-failure > whereby the entire heap must be GCed. > > Consider testing Java 7 and the G1 GC. > > We could get a JNI thread to do this, but no one has done so yet. I am > personally hoping for G1 and in the meantime overprovision our Xmx to > avoid the concurrent mode failures. > > -ryan > > On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <zhongresearch@...> > wrote: > > Ryan, > > > > Thank you very much. > > May I ask whether there are any ways to get around this problem to make > > HBase more stable? > > > > best, > > zhong > > > > > > > > On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> wrote: > > > >> There isnt any working code yet. Just an idea, and a prototype. > >> > >> There is some sense that if we can get the G1 GC that we could get rid > >> of all long pauses, and avoid the need for this. > >> > >> -ryan > >> > >> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@...> > >> wrote: > >> > Hi, > >> > > >> > I am very interesting to the solution that Joey proposed and would > like > >> to > >> > have a try. > >> > Does anyone have any ideas on how to deploy this zk_wrapper in JNI > >> > integration? > >> > > >> > I would be very appreciated. > >> > > >> > thanks > >> > zhong > >> > > >> > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Sorry I must have mistyped, I meant to say "40 seconds". You can
still see multi-second pauses at times, so you need to give yourself a bigger buffer. The parallel threads argument should not be necessary, but you do need the UseConcMarkSweepGC flag as well. Let us know how it goes! -ryan On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <zhongresearch@...> wrote: > Ryan, > I am very appreciated for your feedbacks. > I have set the zookeeper.session.timeout to seconds which is way higher than > 40ms. > In the same time, the -Xms is set to 4GB, which should be sufficient. > I also tried GC options like > > -XX:ParallelGCThreads=8 > -XX:+UseConcMarkSweepGC > > I even set the vm.swappiness=0 > > However, I still came across the problem that a RegionServer shutdown > itself. > > Best, > zhong > > > On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> wrote: > >> Set the ZK timeout to something like 40ms, and give the GC enough Xmx >> so you never risk entering the much dreaded concurrent-mode-failure >> whereby the entire heap must be GCed. >> >> Consider testing Java 7 and the G1 GC. >> >> We could get a JNI thread to do this, but no one has done so yet. I am >> personally hoping for G1 and in the meantime overprovision our Xmx to >> avoid the concurrent mode failures. >> >> -ryan >> >> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <zhongresearch@...> >> wrote: >> > Ryan, >> > >> > Thank you very much. >> > May I ask whether there are any ways to get around this problem to make >> > HBase more stable? >> > >> > best, >> > zhong >> > >> > >> > >> > On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> wrote: >> > >> >> There isnt any working code yet. Just an idea, and a prototype. >> >> >> >> There is some sense that if we can get the G1 GC that we could get rid >> >> of all long pauses, and avoid the need for this. >> >> >> >> -ryan >> >> >> >> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@...> >> >> wrote: >> >> > Hi, >> >> > >> >> > I am very interesting to the solution that Joey proposed and would >> like >> >> to >> >> > have a try. >> >> > Does anyone have any ideas on how to deploy this zk_wrapper in JNI >> >> > integration? >> >> > >> >> > I would be very appreciated. >> >> > >> >> > thanks >> >> > zhong >> >> > >> >> >> > >> > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Hi Zhenyu,
Sorry for the delay. I started working on this a while back, before I left my job for another company. Since then I haven't had much time to work on HBase unfortunately :(. I'll try to dig up what I had and see what shape it's in and update you. Cheers, -n On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > Sorry I must have mistyped, I meant to say "40 seconds". You can > still see multi-second pauses at times, so you need to give yourself a > bigger buffer. > > The parallel threads argument should not be necessary, but you do need > the UseConcMarkSweepGC flag as well. > > Let us know how it goes! > -ryan > > > On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong > <zhongresearch@...> wrote: >> Ryan, >> I am very appreciated for your feedbacks. >> I have set the zookeeper.session.timeout to seconds which is way >> higher than >> 40ms. >> In the same time, the -Xms is set to 4GB, which should be sufficient. >> I also tried GC options like >> >> -XX:ParallelGCThreads=8 >> -XX:+UseConcMarkSweepGC >> >> I even set the vm.swappiness=0 >> >> However, I still came across the problem that a RegionServer shutdown >> itself. >> >> Best, >> zhong >> >> >> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> >> wrote: >> >>> Set the ZK timeout to something like 40ms, and give the GC enough >>> Xmx >>> so you never risk entering the much dreaded concurrent-mode-failure >>> whereby the entire heap must be GCed. >>> >>> Consider testing Java 7 and the G1 GC. >>> >>> We could get a JNI thread to do this, but no one has done so yet. >>> I am >>> personally hoping for G1 and in the meantime overprovision our Xmx >>> to >>> avoid the concurrent mode failures. >>> >>> -ryan >>> >>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <zhongresearch@... >>> > >>> wrote: >>>> Ryan, >>>> >>>> Thank you very much. >>>> May I ask whether there are any ways to get around this problem >>>> to make >>>> HBase more stable? >>>> >>>> best, >>>> zhong >>>> >>>> >>>> >>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> >>>> wrote: >>>> >>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>> >>>>> There is some sense that if we can get the G1 GC that we could >>>>> get rid >>>>> of all long pauses, and avoid the need for this. >>>>> >>>>> -ryan >>>>> >>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <zhongresearch@... >>>>> > >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> I am very interesting to the solution that Joey proposed and >>>>>> would >>> like >>>>> to >>>>>> have a try. >>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in >>>>>> JNI >>>>>> integration? >>>>>> >>>>>> I would be very appreciated. >>>>>> >>>>>> thanks >>>>>> zhong >>>>>> >>>>> >>>> >>> >> |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Nitay,
I am very appreciated. As Ryan suggested, I increased the zookeeper session timeout to 40seconds along with the GC options -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. However it still ran into problem. Please find the following errors. org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server x.x.x.x:60021 for region YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 attempts. Exceptions: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy to /x.x.x.x:60021 after attempts=1 at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) The input file is about 10GB around 200million rows of data. This load doesn't seem too large. However this kind of errors keep popping up. Does Regionserver need to be deployed to dedicated machines? Does Zookeeper need to be deployed to dedicated machines as well? Best, zhenyu On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > Hi Zhenyu, > > Sorry for the delay. I started working on this a while back, before I left > my job for another company. Since then I haven't had much time to work on > HBase unfortunately :(. I'll try to dig up what I had and see what shape > it's in and update you. > > Cheers, > -n > > > On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > Sorry I must have mistyped, I meant to say "40 seconds". You can >> still see multi-second pauses at times, so you need to give yourself a >> bigger buffer. >> >> The parallel threads argument should not be necessary, but you do need >> the UseConcMarkSweepGC flag as well. >> >> Let us know how it goes! >> -ryan >> >> >> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <zhongresearch@...> >> wrote: >> >>> Ryan, >>> I am very appreciated for your feedbacks. >>> I have set the zookeeper.session.timeout to seconds which is way higher >>> than >>> 40ms. >>> In the same time, the -Xms is set to 4GB, which should be sufficient. >>> I also tried GC options like >>> >>> -XX:ParallelGCThreads=8 >>> -XX:+UseConcMarkSweepGC >>> >>> I even set the vm.swappiness=0 >>> >>> However, I still came across the problem that a RegionServer shutdown >>> itself. >>> >>> Best, >>> zhong >>> >>> >>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> wrote: >>> >>> Set the ZK timeout to something like 40ms, and give the GC enough Xmx >>>> so you never risk entering the much dreaded concurrent-mode-failure >>>> whereby the entire heap must be GCed. >>>> >>>> Consider testing Java 7 and the G1 GC. >>>> >>>> We could get a JNI thread to do this, but no one has done so yet. I am >>>> personally hoping for G1 and in the meantime overprovision our Xmx to >>>> avoid the concurrent mode failures. >>>> >>>> -ryan >>>> >>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <zhongresearch@...> >>>> wrote: >>>> >>>>> Ryan, >>>>> >>>>> Thank you very much. >>>>> May I ask whether there are any ways to get around this problem to make >>>>> HBase more stable? >>>>> >>>>> best, >>>>> zhong >>>>> >>>>> >>>>> >>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> >>>>> wrote: >>>>> >>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>>> >>>>>> There is some sense that if we can get the G1 GC that we could get rid >>>>>> of all long pauses, and avoid the need for this. >>>>>> >>>>>> -ryan >>>>>> >>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >>>>>> zhongresearch@...> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am very interesting to the solution that Joey proposed and would >>>>>>> >>>>>> like >>>> >>>>> to >>>>>> >>>>>>> have a try. >>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in JNI >>>>>>> integration? >>>>>>> >>>>>>> I would be very appreciated. >>>>>>> >>>>>>> thanks >>>>>>> zhong >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Whats your cluster topology? How many nodes involved? When you see the
below message, how many regions in your table? How are you loading your table? Thanks, St.Ack On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@...>wrote: > Nitay, > > I am very appreciated. > > As Ryan suggested, I increased the zookeeper session timeout to 40seconds > along with the GC options -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC > in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. > > However it still ran into problem. Please find the following errors. > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > contact region server x.x.x.x:60021 for region > YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 > attempts. > Exceptions: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > setting up proxy to /x.x.x.x:60021 after attempts=1 > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > The input file is about 10GB around 200million rows of data. > This load doesn't seem too large. However this kind of errors keep popping > up. > > Does Regionserver need to be deployed to dedicated machines? > Does Zookeeper need to be deployed to dedicated machines as well? > > Best, > zhenyu > > > > On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > > > Hi Zhenyu, > > > > Sorry for the delay. I started working on this a while back, before I > left > > my job for another company. Since then I haven't had much time to work on > > HBase unfortunately :(. I'll try to dig up what I had and see what shape > > it's in and update you. > > > > Cheers, > > -n > > > > > > On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > > Sorry I must have mistyped, I meant to say "40 seconds". You can > >> still see multi-second pauses at times, so you need to give yourself a > >> bigger buffer. > >> > >> The parallel threads argument should not be necessary, but you do need > >> the UseConcMarkSweepGC flag as well. > >> > >> Let us know how it goes! > >> -ryan > >> > >> > >> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <zhongresearch@...> > >> wrote: > >> > >>> Ryan, > >>> I am very appreciated for your feedbacks. > >>> I have set the zookeeper.session.timeout to seconds which is way higher > >>> than > >>> 40ms. > >>> In the same time, the -Xms is set to 4GB, which should be sufficient. > >>> I also tried GC options like > >>> > >>> -XX:ParallelGCThreads=8 > >>> -XX:+UseConcMarkSweepGC > >>> > >>> I even set the vm.swappiness=0 > >>> > >>> However, I still came across the problem that a RegionServer shutdown > >>> itself. > >>> > >>> Best, > >>> zhong > >>> > >>> > >>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> > wrote: > >>> > >>> Set the ZK timeout to something like 40ms, and give the GC enough Xmx > >>>> so you never risk entering the much dreaded concurrent-mode-failure > >>>> whereby the entire heap must be GCed. > >>>> > >>>> Consider testing Java 7 and the G1 GC. > >>>> > >>>> We could get a JNI thread to do this, but no one has done so yet. I am > >>>> personally hoping for G1 and in the meantime overprovision our Xmx to > >>>> avoid the concurrent mode failures. > >>>> > >>>> -ryan > >>>> > >>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > zhongresearch@...> > >>>> wrote: > >>>> > >>>>> Ryan, > >>>>> > >>>>> Thank you very much. > >>>>> May I ask whether there are any ways to get around this problem to > make > >>>>> HBase more stable? > >>>>> > >>>>> best, > >>>>> zhong > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> > >>>>> wrote: > >>>>> > >>>>> There isnt any working code yet. Just an idea, and a prototype. > >>>>>> > >>>>>> There is some sense that if we can get the G1 GC that we could get > rid > >>>>>> of all long pauses, and avoid the need for this. > >>>>>> > >>>>>> -ryan > >>>>>> > >>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > >>>>>> zhongresearch@...> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I am very interesting to the solution that Joey proposed and would > >>>>>>> > >>>>>> like > >>>> > >>>>> to > >>>>>> > >>>>>>> have a try. > >>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in JNI > >>>>>>> integration? > >>>>>>> > >>>>>>> I would be very appreciated. > >>>>>>> > >>>>>>> thanks > >>>>>>> zhong > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Stack,
Thank you very much for your comments. I am running a cluster with 20 nodes. I set 19 as both regionserver and zookeeper quorums. The versions I am using are Hadoop0.20.1 and HBase0.20.1. I started with an empty table and try to load 200 million records into that empty table. There is a key in each record. Logically, in my MR program, during the setup, I opened an HTable, in my mapper, I fetch the record from HTable via key in the record, then make some changes to the columns and update that row back to HTable through TableOutputFormat by passing a put. There is no reduce tasks involved here. (Though it is unnecessary to fetch row from an empty table, I just intended to do that) Additionally, when I reduced the number of regionservers and number of zookeeper quorums. I had different errors: org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Many thanks in advance. zhenyu On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: > Whats your cluster topology? How many nodes involved? When you see the > below message, how many regions in your table? How are you loading your > table? > Thanks, > St.Ack > > On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@... > >wrote: > > > Nitay, > > > > I am very appreciated. > > > > As Ryan suggested, I increased the zookeeper session timeout to 40seconds > > along with the GC options -XX:ParallelGCThreads=8 > -XX:+UseConcMarkSweepGC > > in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. > > > > However it still ran into problem. Please find the following errors. > > > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > > contact region server x.x.x.x:60021 for region > > YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 > > attempts. > > Exceptions: > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > > > > The input file is about 10GB around 200million rows of data. > > This load doesn't seem too large. However this kind of errors keep > popping > > up. > > > > Does Regionserver need to be deployed to dedicated machines? > > Does Zookeeper need to be deployed to dedicated machines as well? > > > > Best, > > zhenyu > > > > > > > > On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > > > > > Hi Zhenyu, > > > > > > Sorry for the delay. I started working on this a while back, before I > > left > > > my job for another company. Since then I haven't had much time to work > on > > > HBase unfortunately :(. I'll try to dig up what I had and see what > shape > > > it's in and update you. > > > > > > Cheers, > > > -n > > > > > > > > > On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > > > > Sorry I must have mistyped, I meant to say "40 seconds". You can > > >> still see multi-second pauses at times, so you need to give yourself a > > >> bigger buffer. > > >> > > >> The parallel threads argument should not be necessary, but you do need > > >> the UseConcMarkSweepGC flag as well. > > >> > > >> Let us know how it goes! > > >> -ryan > > >> > > >> > > >> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > zhongresearch@...> > > >> wrote: > > >> > > >>> Ryan, > > >>> I am very appreciated for your feedbacks. > > >>> I have set the zookeeper.session.timeout to seconds which is way > higher > > >>> than > > >>> 40ms. > > >>> In the same time, the -Xms is set to 4GB, which should be sufficient. > > >>> I also tried GC options like > > >>> > > >>> -XX:ParallelGCThreads=8 > > >>> -XX:+UseConcMarkSweepGC > > >>> > > >>> I even set the vm.swappiness=0 > > >>> > > >>> However, I still came across the problem that a RegionServer shutdown > > >>> itself. > > >>> > > >>> Best, > > >>> zhong > > >>> > > >>> > > >>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> > > wrote: > > >>> > > >>> Set the ZK timeout to something like 40ms, and give the GC enough > Xmx > > >>>> so you never risk entering the much dreaded concurrent-mode-failure > > >>>> whereby the entire heap must be GCed. > > >>>> > > >>>> Consider testing Java 7 and the G1 GC. > > >>>> > > >>>> We could get a JNI thread to do this, but no one has done so yet. I > am > > >>>> personally hoping for G1 and in the meantime overprovision our Xmx > to > > >>>> avoid the concurrent mode failures. > > >>>> > > >>>> -ryan > > >>>> > > >>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > zhongresearch@...> > > >>>> wrote: > > >>>> > > >>>>> Ryan, > > >>>>> > > >>>>> Thank you very much. > > >>>>> May I ask whether there are any ways to get around this problem to > > make > > >>>>> HBase more stable? > > >>>>> > > >>>>> best, > > >>>>> zhong > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> > > >>>>> wrote: > > >>>>> > > >>>>> There isnt any working code yet. Just an idea, and a prototype. > > >>>>>> > > >>>>>> There is some sense that if we can get the G1 GC that we could get > > rid > > >>>>>> of all long pauses, and avoid the need for this. > > >>>>>> > > >>>>>> -ryan > > >>>>>> > > >>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > >>>>>> zhongresearch@...> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi, > > >>>>>>> > > >>>>>>> I am very interesting to the solution that Joey proposed and > would > > >>>>>>> > > >>>>>> like > > >>>> > > >>>>> to > > >>>>>> > > >>>>>>> have a try. > > >>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in > JNI > > >>>>>>> integration? > > >>>>>>> > > >>>>>>> I would be very appreciated. > > >>>>>>> > > >>>>>>> thanks > > >>>>>>> zhong > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)These client error messages are not particular descriptive as to the
root cause (they are fatal errors, or close to it). What is going on in your regionservers when these errors happen? Check the master and RS logs. Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or 5 max. What is the hardware you are using for these nodes, and what settings do you have for heap/GC? JG Zhenyu Zhong wrote: > Stack, > > Thank you very much for your comments. > I am running a cluster with 20 nodes. I set 19 as both regionserver and > zookeeper quorums. > The versions I am using are Hadoop0.20.1 and HBase0.20.1. > I started with an empty table and try to load 200 million records into that > empty table. > There is a key in each record. Logically, in my MR program, during the > setup, I opened an HTable, in my mapper, I fetch the record from HTable via > key in the record, then make some changes to the columns and update that row > back to HTable through TableOutputFormat by passing a put. There is no > reduce tasks involved here. (Though it is unnecessary to fetch row from an > empty table, I just intended to do that) > > Additionally, when I reduced the number of regionservers and number of > zookeeper quorums. > I had different errors: > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying > to locate root region at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > org.apache.hadoop.mapred.Child.main(Child.java:170) > > Many thanks in advance. > zhenyu > > > > > On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: > >> Whats your cluster topology? How many nodes involved? When you see the >> below message, how many regions in your table? How are you loading your >> table? >> Thanks, >> St.Ack >> >> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@... >>> wrote: >>> Nitay, >>> >>> I am very appreciated. >>> >>> As Ryan suggested, I increased the zookeeper session timeout to 40seconds >>> along with the GC options -XX:ParallelGCThreads=8 >> -XX:+UseConcMarkSweepGC >>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. >>> >>> However it still ran into problem. Please find the following errors. >>> >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>> contact region server x.x.x.x:60021 for region >>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 >>> attempts. >>> Exceptions: >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>> >>> at >>> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) >>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) >>> >>> >>> The input file is about 10GB around 200million rows of data. >>> This load doesn't seem too large. However this kind of errors keep >> popping >>> up. >>> >>> Does Regionserver need to be deployed to dedicated machines? >>> Does Zookeeper need to be deployed to dedicated machines as well? >>> >>> Best, >>> zhenyu >>> >>> >>> >>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: >>> >>>> Hi Zhenyu, >>>> >>>> Sorry for the delay. I started working on this a while back, before I >>> left >>>> my job for another company. Since then I haven't had much time to work >> on >>>> HBase unfortunately :(. I'll try to dig up what I had and see what >> shape >>>> it's in and update you. >>>> >>>> Cheers, >>>> -n >>>> >>>> >>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: >>>> >>>> Sorry I must have mistyped, I meant to say "40 seconds". You can >>>>> still see multi-second pauses at times, so you need to give yourself a >>>>> bigger buffer. >>>>> >>>>> The parallel threads argument should not be necessary, but you do need >>>>> the UseConcMarkSweepGC flag as well. >>>>> >>>>> Let us know how it goes! >>>>> -ryan >>>>> >>>>> >>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < >> zhongresearch@...> >>>>> wrote: >>>>> >>>>>> Ryan, >>>>>> I am very appreciated for your feedbacks. >>>>>> I have set the zookeeper.session.timeout to seconds which is way >> higher >>>>>> than >>>>>> 40ms. >>>>>> In the same time, the -Xms is set to 4GB, which should be sufficient. >>>>>> I also tried GC options like >>>>>> >>>>>> -XX:ParallelGCThreads=8 >>>>>> -XX:+UseConcMarkSweepGC >>>>>> >>>>>> I even set the vm.swappiness=0 >>>>>> >>>>>> However, I still came across the problem that a RegionServer shutdown >>>>>> itself. >>>>>> >>>>>> Best, >>>>>> zhong >>>>>> >>>>>> >>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> >>> wrote: >>>>>> Set the ZK timeout to something like 40ms, and give the GC enough >> Xmx >>>>>>> so you never risk entering the much dreaded concurrent-mode-failure >>>>>>> whereby the entire heap must be GCed. >>>>>>> >>>>>>> Consider testing Java 7 and the G1 GC. >>>>>>> >>>>>>> We could get a JNI thread to do this, but no one has done so yet. I >> am >>>>>>> personally hoping for G1 and in the meantime overprovision our Xmx >> to >>>>>>> avoid the concurrent mode failures. >>>>>>> >>>>>>> -ryan >>>>>>> >>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < >>> zhongresearch@...> >>>>>>> wrote: >>>>>>> >>>>>>>> Ryan, >>>>>>>> >>>>>>>> Thank you very much. >>>>>>>> May I ask whether there are any ways to get around this problem to >>> make >>>>>>>> HBase more stable? >>>>>>>> >>>>>>>> best, >>>>>>>> zhong >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> >>>>>>>> wrote: >>>>>>>> >>>>>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>>>>>> There is some sense that if we can get the G1 GC that we could get >>> rid >>>>>>>>> of all long pauses, and avoid the need for this. >>>>>>>>> >>>>>>>>> -ryan >>>>>>>>> >>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >>>>>>>>> zhongresearch@...> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am very interesting to the solution that Joey proposed and >> would >>>>>>>>> like >>>>>>>> to >>>>>>>>>> have a try. >>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in >> JNI >>>>>>>>>> integration? >>>>>>>>>> >>>>>>>>>> I would be very appreciated. >>>>>>>>>> >>>>>>>>>> thanks >>>>>>>>>> zhong >>>>>>>>>> >>>>>>>>>> > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)JG,
Thanks a lot for the tips. I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC. I checked the logs in my Master an RS and found the following errors. Basically, master got exception error while scanning ROOT, then the ROOT region was offline and unset. Thus the regionserver can't get NotservingRegion errors. In the master: 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: x.x.x. x:60021, regionname: -ROOT-,,0, startKey: <>} 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan ROOT region java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: java.io.EOFException at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) at $Proxy1.openScanner(Unknown Source) at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) at org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) at org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) at org.apache.hadoop.hbase.Chore.run(Chore.java:68) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: x.x.x. x:60021, regionname: .META.,,1, startKey: <>} 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan one META region: {server: x.x.x.x:60021, regionname: .M ETA.,,1, startKey: <>} java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) at $Proxy1.openScanner(Unknown Source) at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) at org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) at org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) at org.apache.hadoop.hbase.Chore.run(Chore.java:68) 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.ServerManager: Removing server's info YYYY,60021,125675547057 0 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: Offlined ROOT server: x.x.x.x:60021 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: -ROOT- region unset (but not set to be reassigned) 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: ROOT inserted into regionsInTransition 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, startKey: <>} 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: META region removed from onlineMetaRegions On the regionserver: 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: test,,1256755871065 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: test,,1256755871065 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: region test,,1256755871065/796855017 available; sequence id is 10013291 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on region test,,1256755871065 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x249c76021d0001 after 0ms 2009-10-28 18:51:19,341 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60021, call get([B@21fefd80, row=1053508149, maxVersions=1, timeRange=[0,9223372036854775807), families={(family=email_ip_activity, columns=ALL}) from x.x.x.x:54669: error: org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> wrote: > These client error messages are not particular descriptive as to the root > cause (they are fatal errors, or close to it). > > What is going on in your regionservers when these errors happen? Check the > master and RS logs. > > Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or 5 > max. > > What is the hardware you are using for these nodes, and what settings do > you have for heap/GC? > > JG > > > Zhenyu Zhong wrote: > >> Stack, >> >> Thank you very much for your comments. >> I am running a cluster with 20 nodes. I set 19 as both regionserver and >> zookeeper quorums. >> The versions I am using are Hadoop0.20.1 and HBase0.20.1. >> I started with an empty table and try to load 200 million records into >> that >> empty table. >> There is a key in each record. Logically, in my MR program, during the >> setup, I opened an HTable, in my mapper, I fetch the record from HTable >> via >> key in the record, then make some changes to the columns and update that >> row >> back to HTable through TableOutputFormat by passing a put. There is no >> reduce tasks involved here. (Though it is unnecessary to fetch row from >> an >> empty table, I just intended to do that) >> >> Additionally, when I reduced the number of regionservers and number of >> zookeeper quorums. >> I had different errors: >> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out >> trying >> to locate root region at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) >> at >> >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) >> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at >> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at >> >> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at >> org.apache.hadoop.mapred.Child.main(Child.java:170) >> >> Many thanks in advance. >> zhenyu >> >> >> >> >> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: >> >> Whats your cluster topology? How many nodes involved? When you see the >>> below message, how many regions in your table? How are you loading your >>> table? >>> Thanks, >>> St.Ack >>> >>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@... >>> >>>> wrote: >>>> Nitay, >>>> >>>> I am very appreciated. >>>> >>>> As Ryan suggested, I increased the zookeeper session timeout to >>>> 40seconds >>>> along with the GC options -XX:ParallelGCThreads=8 >>>> >>> -XX:+UseConcMarkSweepGC >>> >>>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. >>>> >>>> However it still ran into problem. Please find the following errors. >>>> >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>> contact region server x.x.x.x:60021 for region >>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 >>>> attempts. >>>> Exceptions: >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>> >>>> at >>>> >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) >>> >>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) >>>> >>>> >>>> The input file is about 10GB around 200million rows of data. >>>> This load doesn't seem too large. However this kind of errors keep >>>> >>> popping >>> >>>> up. >>>> >>>> Does Regionserver need to be deployed to dedicated machines? >>>> Does Zookeeper need to be deployed to dedicated machines as well? >>>> >>>> Best, >>>> zhenyu >>>> >>>> >>>> >>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: >>>> >>>> Hi Zhenyu, >>>>> >>>>> Sorry for the delay. I started working on this a while back, before I >>>>> >>>> left >>>> >>>>> my job for another company. Since then I haven't had much time to work >>>>> >>>> on >>> >>>> HBase unfortunately :(. I'll try to dig up what I had and see what >>>>> >>>> shape >>> >>>> it's in and update you. >>>>> >>>>> Cheers, >>>>> -n >>>>> >>>>> >>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: >>>>> >>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can >>>>> >>>>>> still see multi-second pauses at times, so you need to give yourself a >>>>>> bigger buffer. >>>>>> >>>>>> The parallel threads argument should not be necessary, but you do need >>>>>> the UseConcMarkSweepGC flag as well. >>>>>> >>>>>> Let us know how it goes! >>>>>> -ryan >>>>>> >>>>>> >>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < >>>>>> >>>>> zhongresearch@...> >>> >>>> wrote: >>>>>> >>>>>> Ryan, >>>>>>> I am very appreciated for your feedbacks. >>>>>>> I have set the zookeeper.session.timeout to seconds which is way >>>>>>> >>>>>> higher >>> >>>> than >>>>>>> 40ms. >>>>>>> In the same time, the -Xms is set to 4GB, which should be sufficient. >>>>>>> I also tried GC options like >>>>>>> >>>>>>> -XX:ParallelGCThreads=8 >>>>>>> -XX:+UseConcMarkSweepGC >>>>>>> >>>>>>> I even set the vm.swappiness=0 >>>>>>> >>>>>>> However, I still came across the problem that a RegionServer shutdown >>>>>>> itself. >>>>>>> >>>>>>> Best, >>>>>>> zhong >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> >>>>>>> >>>>>> wrote: >>>> >>>>> Set the ZK timeout to something like 40ms, and give the GC enough >>>>>>> >>>>>> Xmx >>> >>>> so you never risk entering the much dreaded concurrent-mode-failure >>>>>>>> whereby the entire heap must be GCed. >>>>>>>> >>>>>>>> Consider testing Java 7 and the G1 GC. >>>>>>>> >>>>>>>> We could get a JNI thread to do this, but no one has done so yet. I >>>>>>>> >>>>>>> am >>> >>>> personally hoping for G1 and in the meantime overprovision our Xmx >>>>>>>> >>>>>>> to >>> >>>> avoid the concurrent mode failures. >>>>>>>> >>>>>>>> -ryan >>>>>>>> >>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < >>>>>>>> >>>>>>> zhongresearch@...> >>>> >>>>> wrote: >>>>>>>> >>>>>>>> Ryan, >>>>>>>>> >>>>>>>>> Thank you very much. >>>>>>>>> May I ask whether there are any ways to get around this problem to >>>>>>>>> >>>>>>>> make >>>> >>>>> HBase more stable? >>>>>>>>> >>>>>>>>> best, >>>>>>>>> zhong >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>>>>>> >>>>>>>>>> There is some sense that if we can get the G1 GC that we could get >>>>>>>>>> >>>>>>>>> rid >>>> >>>>> of all long pauses, and avoid the need for this. >>>>>>>>>> >>>>>>>>>> -ryan >>>>>>>>>> >>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >>>>>>>>>> zhongresearch@...> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am very interesting to the solution that Joey proposed and >>>>>>>>>>> >>>>>>>>>> would >>> >>>> like >>>>>>>>>> >>>>>>>>> to >>>>>>>>> >>>>>>>>>> have a try. >>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in >>>>>>>>>>> >>>>>>>>>> JNI >>> >>>> integration? >>>>>>>>>>> >>>>>>>>>>> I would be very appreciated. >>>>>>>>>>> >>>>>>>>>>> thanks >>>>>>>>>>> zhong >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >> |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)FYI
It looks like increasing the number of Zookeeper Quorums can solve the following error message : org.apache.hadoop.hbase. client.NoServerForRegionException: Timed out trying to locate root region at org.apache.hadoop.hbase. Now I am running Zookeeper quorum on each node I have. However, I am still having issues about losing regionserver. Is there a way to browse the Znode in zookeeper? thanks zhenyu On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <zhongresearch@...>wrote: > JG, > > > Thanks a lot for the tips. > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > -XX:+UseConcMarkSweepGC. > > I checked the logs in my Master an RS and found the following errors. > Basically, master got exception error while scanning ROOT, then the ROOT > region was offline and unset. Thus the regionserver can't get > NotservingRegion errors. > > In the master: > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: x.x.x. > x:60021, regionname: -ROOT-,,0, startKey: <>} > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > Scan ROOT region > java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: > java.io.EOFException > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy1.openScanner(Unknown Source) > at > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > at > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > at > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > at > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: x.x.x. > x:60021, regionname: .META.,,1, startKey: <>} > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > Scan one META region: {server: x.x.x.x:60021, regionname: .M > ETA.,,1, startKey: <>} > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy1.openScanner(Unknown Source) > at > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > at > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > at > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > at > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > All 1 .META. region(s) scanned > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.ServerManager: > Removing server's info YYYY,60021,125675547057 > 0 > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > Offlined ROOT server: x.x.x.x:60021 > > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > -ROOT- region unset (but not set to be reassigned) > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > ROOT inserted into regionsInTransition > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > startKey: <>} > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > META region removed from onlineMetaRegions > > > > On the regionserver: > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > test,,1256755871065 > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > test,,1256755871065 > 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region test,,1256755871065/796855017 available; sequence id is 10013291 > 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region test,,1256755871065 > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping > response for sessionid:0x249c76021d0001 after 0ms > 2009-10-28 18:51:19,341 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60021, call get([B@21fefd80, row=1053508149, maxVersions=1, > timeRange=[0,9223372036854775807), families={(family=email_ip_activity, > columns=ALL}) from x.x.x.x:54669: error: > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> wrote: > >> These client error messages are not particular descriptive as to the root >> cause (they are fatal errors, or close to it). >> >> What is going on in your regionservers when these errors happen? Check >> the master and RS logs. >> >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or >> 5 max. >> >> What is the hardware you are using for these nodes, and what settings do >> you have for heap/GC? >> >> JG >> >> >> Zhenyu Zhong wrote: >> >>> Stack, >>> >>> Thank you very much for your comments. >>> I am running a cluster with 20 nodes. I set 19 as both regionserver and >>> zookeeper quorums. >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. >>> I started with an empty table and try to load 200 million records into >>> that >>> empty table. >>> There is a key in each record. Logically, in my MR program, during the >>> setup, I opened an HTable, in my mapper, I fetch the record from HTable >>> via >>> key in the record, then make some changes to the columns and update that >>> row >>> back to HTable through TableOutputFormat by passing a put. There is no >>> reduce tasks involved here. (Though it is unnecessary to fetch row from >>> an >>> empty table, I just intended to do that) >>> >>> Additionally, when I reduced the number of regionservers and number of >>> zookeeper quorums. >>> I had different errors: >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out >>> trying >>> to locate root region at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at >>> >>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at >>> org.apache.hadoop.mapred.Child.main(Child.java:170) >>> >>> Many thanks in advance. >>> zhenyu >>> >>> >>> >>> >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: >>> >>> Whats your cluster topology? How many nodes involved? When you see the >>>> below message, how many regions in your table? How are you loading your >>>> table? >>>> Thanks, >>>> St.Ack >>>> >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@... >>>> >>>>> wrote: >>>>> Nitay, >>>>> >>>>> I am very appreciated. >>>>> >>>>> As Ryan suggested, I increased the zookeeper session timeout to >>>>> 40seconds >>>>> along with the GC options -XX:ParallelGCThreads=8 >>>>> >>>> -XX:+UseConcMarkSweepGC >>>> >>>>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. >>>>> >>>>> However it still ran into problem. Please find the following errors. >>>>> >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>>> contact region server x.x.x.x:60021 for region >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 >>>>> attempts. >>>>> Exceptions: >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> >>>>> at >>>>> >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) >>>> >>>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) >>>>> >>>>> >>>>> The input file is about 10GB around 200million rows of data. >>>>> This load doesn't seem too large. However this kind of errors keep >>>>> >>>> popping >>>> >>>>> up. >>>>> >>>>> Does Regionserver need to be deployed to dedicated machines? >>>>> Does Zookeeper need to be deployed to dedicated machines as well? >>>>> >>>>> Best, >>>>> zhenyu >>>>> >>>>> >>>>> >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: >>>>> >>>>> Hi Zhenyu, >>>>>> >>>>>> Sorry for the delay. I started working on this a while back, before I >>>>>> >>>>> left >>>>> >>>>>> my job for another company. Since then I haven't had much time to work >>>>>> >>>>> on >>>> >>>>> HBase unfortunately :(. I'll try to dig up what I had and see what >>>>>> >>>>> shape >>>> >>>>> it's in and update you. >>>>>> >>>>>> Cheers, >>>>>> -n >>>>>> >>>>>> >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: >>>>>> >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can >>>>>> >>>>>>> still see multi-second pauses at times, so you need to give yourself >>>>>>> a >>>>>>> bigger buffer. >>>>>>> >>>>>>> The parallel threads argument should not be necessary, but you do >>>>>>> need >>>>>>> the UseConcMarkSweepGC flag as well. >>>>>>> >>>>>>> Let us know how it goes! >>>>>>> -ryan >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < >>>>>>> >>>>>> zhongresearch@...> >>>> >>>>> wrote: >>>>>>> >>>>>>> Ryan, >>>>>>>> I am very appreciated for your feedbacks. >>>>>>>> I have set the zookeeper.session.timeout to seconds which is way >>>>>>>> >>>>>>> higher >>>> >>>>> than >>>>>>>> 40ms. >>>>>>>> In the same time, the -Xms is set to 4GB, which should be >>>>>>>> sufficient. >>>>>>>> I also tried GC options like >>>>>>>> >>>>>>>> -XX:ParallelGCThreads=8 >>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>> >>>>>>>> I even set the vm.swappiness=0 >>>>>>>> >>>>>>>> However, I still came across the problem that a RegionServer >>>>>>>> shutdown >>>>>>>> itself. >>>>>>>> >>>>>>>> Best, >>>>>>>> zhong >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> >>>>>>>> >>>>>>> wrote: >>>>> >>>>>> Set the ZK timeout to something like 40ms, and give the GC enough >>>>>>>> >>>>>>> Xmx >>>> >>>>> so you never risk entering the much dreaded concurrent-mode-failure >>>>>>>>> whereby the entire heap must be GCed. >>>>>>>>> >>>>>>>>> Consider testing Java 7 and the G1 GC. >>>>>>>>> >>>>>>>>> We could get a JNI thread to do this, but no one has done so yet. I >>>>>>>>> >>>>>>>> am >>>> >>>>> personally hoping for G1 and in the meantime overprovision our Xmx >>>>>>>>> >>>>>>>> to >>>> >>>>> avoid the concurrent mode failures. >>>>>>>>> >>>>>>>>> -ryan >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < >>>>>>>>> >>>>>>>> zhongresearch@...> >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>> Ryan, >>>>>>>>>> >>>>>>>>>> Thank you very much. >>>>>>>>>> May I ask whether there are any ways to get around this problem to >>>>>>>>>> >>>>>>>>> make >>>>> >>>>>> HBase more stable? >>>>>>>>>> >>>>>>>>>> best, >>>>>>>>>> zhong >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@...> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>>>>>>> >>>>>>>>>>> There is some sense that if we can get the G1 GC that we could >>>>>>>>>>> get >>>>>>>>>>> >>>>>>>>>> rid >>>>> >>>>>> of all long pauses, and avoid the need for this. >>>>>>>>>>> >>>>>>>>>>> -ryan >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >>>>>>>>>>> zhongresearch@...> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am very interesting to the solution that Joey proposed and >>>>>>>>>>>> >>>>>>>>>>> would >>>> >>>>> like >>>>>>>>>>> >>>>>>>>>> to >>>>>>>>>> >>>>>>>>>>> have a try. >>>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in >>>>>>>>>>>> >>>>>>>>>>> JNI >>>> >>>>> integration? >>>>>>>>>>>> >>>>>>>>>>>> I would be very appreciated. >>>>>>>>>>>> >>>>>>>>>>>> thanks >>>>>>>>>>>> zhong >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>> > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <zhongresearch@...>wrote:
> FYI > It looks like increasing the number of Zookeeper Quorums can solve the > following error message : org.apache.hadoop.hbase. > client.NoServerForRegionException: Timed out trying to locate root region > at > org.apache.hadoop.hbase. > > You mean quorum members? How many do you have now? > Now I am running Zookeeper quorum on each node I have. > However, I am still having issues about losing regionserver. > > Whats in the logs? > Is there a way to browse the Znode in zookeeper? > > Type 'zk' in the hbase shell. You can get to the zk shell from hbase shell. You so things like: > zk "ls /" (Yes, quotes needed). St.Ack > thanks > zhenyu > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <zhongresearch@... > >wrote: > > > JG, > > > > > > Thanks a lot for the tips. > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > -XX:+UseConcMarkSweepGC. > > > > I checked the logs in my Master an RS and found the following errors. > > Basically, master got exception error while scanning ROOT, then the ROOT > > region was offline and unset. Thus the regionserver can't get > > NotservingRegion errors. > > > > In the master: > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > > Scan ROOT region > > java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: > > java.io.EOFException > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > at > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > at $Proxy1.openScanner(Unknown Source) > > at > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > at > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > at > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > at > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > Caused by: java.io.EOFException > > at java.io.DataInputStream.readInt(DataInputStream.java:375) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > x:60021, regionname: .META.,,1, startKey: <>} > > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > ETA.,,1, startKey: <>} > > java.net.ConnectException: Connection refused > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > at > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > at > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > at $Proxy1.openScanner(Unknown Source) > > at > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > at > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > at > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > at > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > > All 1 .META. region(s) scanned > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.ServerManager: > > Removing server's info YYYY,60021,125675547057 > > 0 > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > Offlined ROOT server: x.x.x.x:60021 > > > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > -ROOT- region unset (but not set to be reassigned) > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > ROOT inserted into regionsInTransition > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > > startKey: <>} > > 2009-10-28 19:00:31,395 INFO > org.apache.hadoop.hbase.master.RegionManager: > > META region removed from onlineMetaRegions > > > > > > > > On the regionserver: > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > > test,,1256755871065 > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > MSG_REGION_OPEN: > > test,,1256755871065 > > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > region test,,1256755871065/796855017 available; sequence id is 10013291 > > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Starting compaction on region test,,1256755871065 > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping > > response for sessionid:0x249c76021d0001 after 0ms > > 2009-10-28 18:51:19,341 ERROR > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > at > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, maxVersions=1, > > timeRange=[0,9223372036854775807), families={(family=email_ip_activity, > > columns=ALL}) from x.x.x.x:54669: error: > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> > wrote: > > > >> These client error messages are not particular descriptive as to the > root > >> cause (they are fatal errors, or close to it). > >> > >> What is going on in your regionservers when these errors happen? Check > >> the master and RS logs. > >> > >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 > or > >> 5 max. > >> > >> What is the hardware you are using for these nodes, and what settings do > >> you have for heap/GC? > >> > >> JG > >> > >> > >> Zhenyu Zhong wrote: > >> > >>> Stack, > >>> > >>> Thank you very much for your comments. > >>> I am running a cluster with 20 nodes. I set 19 as both regionserver and > >>> zookeeper quorums. > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > >>> I started with an empty table and try to load 200 million records into > >>> that > >>> empty table. > >>> There is a key in each record. Logically, in my MR program, during the > >>> setup, I opened an HTable, in my mapper, I fetch the record from HTable > >>> via > >>> key in the record, then make some changes to the columns and update > that > >>> row > >>> back to HTable through TableOutputFormat by passing a put. There is no > >>> reduce tasks involved here. (Though it is unnecessary to fetch row > from > >>> an > >>> empty table, I just intended to do that) > >>> > >>> Additionally, when I reduced the number of regionservers and number of > >>> zookeeper quorums. > >>> I had different errors: > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out > >>> trying > >>> to locate root region at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > >>> at > >>> > >>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > >>> > >>> > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > >>> > >>> Many thanks in advance. > >>> zhenyu > >>> > >>> > >>> > >>> > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: > >>> > >>> Whats your cluster topology? How many nodes involved? When you see > the > >>>> below message, how many regions in your table? How are you loading > your > >>>> table? > >>>> Thanks, > >>>> St.Ack > >>>> > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > zhongresearch@... > >>>> > >>>>> wrote: > >>>>> Nitay, > >>>>> > >>>>> I am very appreciated. > >>>>> > >>>>> As Ryan suggested, I increased the zookeeper session timeout to > >>>>> 40seconds > >>>>> along with the GC options -XX:ParallelGCThreads=8 > >>>>> > >>>> -XX:+UseConcMarkSweepGC > >>>> > >>>>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. > >>>>> > >>>>> However it still ran into problem. Please find the following errors. > >>>>> > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > >>>>> contact region server x.x.x.x:60021 for region > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after > 10 > >>>>> attempts. > >>>>> Exceptions: > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > >>>>> > >>>>> at > >>>>> > >>>>> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > >>>> > >>>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > >>>>> > >>>>> > >>>>> The input file is about 10GB around 200million rows of data. > >>>>> This load doesn't seem too large. However this kind of errors keep > >>>>> > >>>> popping > >>>> > >>>>> up. > >>>>> > >>>>> Does Regionserver need to be deployed to dedicated machines? > >>>>> Does Zookeeper need to be deployed to dedicated machines as well? > >>>>> > >>>>> Best, > >>>>> zhenyu > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > >>>>> > >>>>> Hi Zhenyu, > >>>>>> > >>>>>> Sorry for the delay. I started working on this a while back, before > I > >>>>>> > >>>>> left > >>>>> > >>>>>> my job for another company. Since then I haven't had much time to > work > >>>>>> > >>>>> on > >>>> > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see what > >>>>>> > >>>>> shape > >>>> > >>>>> it's in and update you. > >>>>>> > >>>>>> Cheers, > >>>>>> -n > >>>>>> > >>>>>> > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > >>>>>> > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can > >>>>>> > >>>>>>> still see multi-second pauses at times, so you need to give > yourself > >>>>>>> a > >>>>>>> bigger buffer. > >>>>>>> > >>>>>>> The parallel threads argument should not be necessary, but you do > >>>>>>> need > >>>>>>> the UseConcMarkSweepGC flag as well. > >>>>>>> > >>>>>>> Let us know how it goes! > >>>>>>> -ryan > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > >>>>>>> > >>>>>> zhongresearch@...> > >>>> > >>>>> wrote: > >>>>>>> > >>>>>>> Ryan, > >>>>>>>> I am very appreciated for your feedbacks. > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is way > >>>>>>>> > >>>>>>> higher > >>>> > >>>>> than > >>>>>>>> 40ms. > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > >>>>>>>> sufficient. > >>>>>>>> I also tried GC options like > >>>>>>>> > >>>>>>>> -XX:ParallelGCThreads=8 > >>>>>>>> -XX:+UseConcMarkSweepGC > >>>>>>>> > >>>>>>>> I even set the vm.swappiness=0 > >>>>>>>> > >>>>>>>> However, I still came across the problem that a RegionServer > >>>>>>>> shutdown > >>>>>>>> itself. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> zhong > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@...> > >>>>>>>> > >>>>>>> wrote: > >>>>> > >>>>>> Set the ZK timeout to something like 40ms, and give the GC enough > >>>>>>>> > >>>>>>> Xmx > >>>> > >>>>> so you never risk entering the much dreaded concurrent-mode-failure > >>>>>>>>> whereby the entire heap must be GCed. > >>>>>>>>> > >>>>>>>>> Consider testing Java 7 and the G1 GC. > >>>>>>>>> > >>>>>>>>> We could get a JNI thread to do this, but no one has done so yet. > I > >>>>>>>>> > >>>>>>>> am > >>>> > >>>>> personally hoping for G1 and in the meantime overprovision our Xmx > >>>>>>>>> > >>>>>>>> to > >>>> > >>>>> avoid the concurrent mode failures. > >>>>>>>>> > >>>>>>>>> -ryan > >>>>>>>>> > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > >>>>>>>>> > >>>>>>>> zhongresearch@...> > >>>>> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Ryan, > >>>>>>>>>> > >>>>>>>>>> Thank you very much. > >>>>>>>>>> May I ask whether there are any ways to get around this problem > to > >>>>>>>>>> > >>>>>>>>> make > >>>>> > >>>>>> HBase more stable? > >>>>>>>>>> > >>>>>>>>>> best, > >>>>>>>>>> zhong > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > ryanobjc@...> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> There isnt any working code yet. Just an idea, and a prototype. > >>>>>>>>>> > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we could > >>>>>>>>>>> get > >>>>>>>>>>> > >>>>>>>>>> rid > >>>>> > >>>>>> of all long pauses, and avoid the need for this. > >>>>>>>>>>> > >>>>>>>>>>> -ryan > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > >>>>>>>>>>> zhongresearch@...> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed and > >>>>>>>>>>>> > >>>>>>>>>>> would > >>>> > >>>>> like > >>>>>>>>>>> > >>>>>>>>>> to > >>>>>>>>>> > >>>>>>>>>>> have a try. > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in > >>>>>>>>>>>> > >>>>>>>>>>> JNI > >>>> > >>>>> integration? > >>>>>>>>>>>> > >>>>>>>>>>>> I would be very appreciated. > >>>>>>>>>>>> > >>>>>>>>>>>> thanks > >>>>>>>>>>>> zhong > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>> > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)I have 19 quorum members now.
When I did test on loading data to two columnfamilies of one table in HBase using two seperate MR jobs, I lost my regionserver and the test failed. Does HBase allow such table update operation? The errors I got while I lost my regionserver is: 2009-10-29 21:09:34,705 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll /hbase/.logs/YYYY,60021,1256849619429/hlog.d at.1256849620029, entries=271911, calcsize=63754142, filesize=33975611. New hlog /hbase/.logs/YYYY,60021,1256849619429/hl og.dat.1256850574705 2009-10-29 21:09:50,322 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 org.apache.hadoop.hbase.Leases$LeaseStillHeldException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571) at java.lang.Thread.run(Thread.java:619) 2009-10-29 21:09:50,773 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x1124a2128bcf0001 to sun.nio.ch.SelectionKeyImpl@663 257b8 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) 2009-10-29 21:09:50,873 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: null 2009-10-29 21:09:51,423 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server YYYY:2181 2009-10-29 21:09:51,423 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/192.168. 100.118:54789 remote=superpyxis0005/192.168.100.119:2181] 2009-10-29 21:09:51,423 INFO org.apache.zookeeper.ClientCnxn: Server connection successful 2009-10-29 21:09:51,423 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Expired, type: None, path: null 2009-10-29 21:09:51,423 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x1124a2128bcf0001 to sun.nio.ch.SelectionKeyImpl@182 9ae5e java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) 2009-10-29 21:09:51,423 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired 2009-10-29 21:09:51,423 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=254.97333, regions=36, stores=119, storefiles=131, storefileIndexSize=8, memstoreSize=39, usedHeap=85, maxHeap=4079, blockCacheSize=7019112, blockCacheFree=848487832, blockCach eCount=0, blockCacheHitRatio=0 2009-10-29 21:09:53,327 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60021 On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <zhongresearch@... > >wrote: > > > FYI > > It looks like increasing the number of Zookeeper Quorums can solve the > > following error message : org.apache.hadoop.hbase. > > client.NoServerForRegionException: Timed out trying to locate root region > > at > > org.apache.hadoop.hbase. > > > > You mean quorum members? How many do you have now? > > > > > Now I am running Zookeeper quorum on each node I have. > > However, I am still having issues about losing regionserver. > > > > Whats in the logs? > > > > > > Is there a way to browse the Znode in zookeeper? > > > > > Type 'zk' in the hbase shell. > > You can get to the zk shell from hbase shell. You so things like: > > > zk "ls /" > > (Yes, quotes needed). > > St.Ack > > > > > thanks > > zhenyu > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <zhongresearch@... > > >wrote: > > > > > JG, > > > > > > > > > Thanks a lot for the tips. > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > > -XX:+UseConcMarkSweepGC. > > > > > > I checked the logs in my Master an RS and found the following errors. > > > Basically, master got exception error while scanning ROOT, then the > ROOT > > > region was offline and unset. Thus the regionserver can't get > > > NotservingRegion errors. > > > > > > In the master: > > > 2009-10-28 19:00:30,591 INFO > org.apache.hadoop.hbase.master.BaseScanner: > > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > > 2009-10-28 19:00:30,591 WARN > org.apache.hadoop.hbase.master.BaseScanner: > > > Scan ROOT region > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: > > > java.io.EOFException > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > > at > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > at $Proxy1.openScanner(Unknown Source) > > > at > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > at > > > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > > at > > > > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > > at > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > Caused by: java.io.EOFException > > > at java.io.DataInputStream.readInt(DataInputStream.java:375) > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > > 2009-10-28 19:00:30,591 INFO > org.apache.hadoop.hbase.master.BaseScanner: > > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > > x:60021, regionname: .META.,,1, startKey: <>} > > > 2009-10-28 19:00:30,591 WARN > org.apache.hadoop.hbase.master.BaseScanner: > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > > ETA.,,1, startKey: <>} > > > java.net.ConnectException: Connection refused > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > > at > > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > > at > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > > at > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > > at > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > at $Proxy1.openScanner(Unknown Source) > > > at > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > at > > > > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > > at > > > > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > > at > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > 2009-10-28 19:00:30,591 INFO > org.apache.hadoop.hbase.master.BaseScanner: > > > All 1 .META. region(s) scanned > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.ServerManager: > > > Removing server's info YYYY,60021,125675547057 > > > 0 > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.RegionManager: > > > Offlined ROOT server: x.x.x.x:60021 > > > > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.RegionManager: > > > -ROOT- region unset (but not set to be reassigned) > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.RegionManager: > > > ROOT inserted into regionsInTransition > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.RegionManager: > > > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > > > startKey: <>} > > > 2009-10-28 19:00:31,395 INFO > > org.apache.hadoop.hbase.master.RegionManager: > > > META region removed from onlineMetaRegions > > > > > > > > > > > > On the regionserver: > > > 2009-10-28 18:51:14,578 INFO > > > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > > > test,,1256755871065 > > > 2009-10-28 18:51:14,578 INFO > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > MSG_REGION_OPEN: > > > test,,1256755871065 > > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegion: > > > region test,,1256755871065/796855017 available; sequence id is 10013291 > > > 2009-10-28 18:51:14,578 INFO > > org.apache.hadoop.hbase.regionserver.HRegion: > > > Starting compaction on region test,,1256755871065 > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping > > > response for sessionid:0x249c76021d0001 after 0ms > > > 2009-10-28 18:51:19,341 ERROR > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > at > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > at > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > Server > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, > maxVersions=1, > > > timeRange=[0,9223372036854775807), families={(family=email_ip_activity, > > > columns=ALL}) from x.x.x.x:54669: error: > > > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> > > wrote: > > > > > >> These client error messages are not particular descriptive as to the > > root > > >> cause (they are fatal errors, or close to it). > > >> > > >> What is going on in your regionservers when these errors happen? > Check > > >> the master and RS logs. > > >> > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 > > or > > >> 5 max. > > >> > > >> What is the hardware you are using for these nodes, and what settings > do > > >> you have for heap/GC? > > >> > > >> JG > > >> > > >> > > >> Zhenyu Zhong wrote: > > >> > > >>> Stack, > > >>> > > >>> Thank you very much for your comments. > > >>> I am running a cluster with 20 nodes. I set 19 as both regionserver > and > > >>> zookeeper quorums. > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > > >>> I started with an empty table and try to load 200 million records > into > > >>> that > > >>> empty table. > > >>> There is a key in each record. Logically, in my MR program, during > the > > >>> setup, I opened an HTable, in my mapper, I fetch the record from > HTable > > >>> via > > >>> key in the record, then make some changes to the columns and update > > that > > >>> row > > >>> back to HTable through TableOutputFormat by passing a put. There is > no > > >>> reduce tasks involved here. (Though it is unnecessary to fetch row > > from > > >>> an > > >>> empty table, I just intended to do that) > > >>> > > >>> Additionally, when I reduced the number of regionservers and number > of > > >>> zookeeper quorums. > > >>> I had different errors: > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out > > >>> trying > > >>> to locate root region at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > > >>> at > > >>> > > >>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > > >>> > > >>> > > > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > > >>> > > >>> Many thanks in advance. > > >>> zhenyu > > >>> > > >>> > > >>> > > >>> > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: > > >>> > > >>> Whats your cluster topology? How many nodes involved? When you see > > the > > >>>> below message, how many regions in your table? How are you loading > > your > > >>>> table? > > >>>> Thanks, > > >>>> St.Ack > > >>>> > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > > zhongresearch@... > > >>>> > > >>>>> wrote: > > >>>>> Nitay, > > >>>>> > > >>>>> I am very appreciated. > > >>>>> > > >>>>> As Ryan suggested, I increased the zookeeper session timeout to > > >>>>> 40seconds > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > > >>>>> > > >>>> -XX:+UseConcMarkSweepGC > > >>>> > > >>>>> in place. I set the Heapsize to 4GB. I also set the > vm.swappiness=0. > > >>>>> > > >>>>> However it still ran into problem. Please find the following > errors. > > >>>>> > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to > > >>>>> contact region server x.x.x.x:60021 for region > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after > > 10 > > >>>>> attempts. > > >>>>> Exceptions: > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > >>>>> > > >>>>> at > > >>>>> > > >>>>> > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > >>>> > > >>>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > >>>>> > > >>>>> > > >>>>> The input file is about 10GB around 200million rows of data. > > >>>>> This load doesn't seem too large. However this kind of errors keep > > >>>>> > > >>>> popping > > >>>> > > >>>>> up. > > >>>>> > > >>>>> Does Regionserver need to be deployed to dedicated machines? > > >>>>> Does Zookeeper need to be deployed to dedicated machines as well? > > >>>>> > > >>>>> Best, > > >>>>> zhenyu > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > > >>>>> > > >>>>> Hi Zhenyu, > > >>>>>> > > >>>>>> Sorry for the delay. I started working on this a while back, > before > > I > > >>>>>> > > >>>>> left > > >>>>> > > >>>>>> my job for another company. Since then I haven't had much time to > > work > > >>>>>> > > >>>>> on > > >>>> > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see what > > >>>>>> > > >>>>> shape > > >>>> > > >>>>> it's in and update you. > > >>>>>> > > >>>>>> Cheers, > > >>>>>> -n > > >>>>>> > > >>>>>> > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > >>>>>> > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can > > >>>>>> > > >>>>>>> still see multi-second pauses at times, so you need to give > > yourself > > >>>>>>> a > > >>>>>>> bigger buffer. > > >>>>>>> > > >>>>>>> The parallel threads argument should not be necessary, but you do > > >>>>>>> need > > >>>>>>> the UseConcMarkSweepGC flag as well. > > >>>>>>> > > >>>>>>> Let us know how it goes! > > >>>>>>> -ryan > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > > >>>>>>> > > >>>>>> zhongresearch@...> > > >>>> > > >>>>> wrote: > > >>>>>>> > > >>>>>>> Ryan, > > >>>>>>>> I am very appreciated for your feedbacks. > > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is way > > >>>>>>>> > > >>>>>>> higher > > >>>> > > >>>>> than > > >>>>>>>> 40ms. > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > > >>>>>>>> sufficient. > > >>>>>>>> I also tried GC options like > > >>>>>>>> > > >>>>>>>> -XX:ParallelGCThreads=8 > > >>>>>>>> -XX:+UseConcMarkSweepGC > > >>>>>>>> > > >>>>>>>> I even set the vm.swappiness=0 > > >>>>>>>> > > >>>>>>>> However, I still came across the problem that a RegionServer > > >>>>>>>> shutdown > > >>>>>>>> itself. > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> zhong > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > ryanobjc@...> > > >>>>>>>> > > >>>>>>> wrote: > > >>>>> > > >>>>>> Set the ZK timeout to something like 40ms, and give the GC > enough > > >>>>>>>> > > >>>>>>> Xmx > > >>>> > > >>>>> so you never risk entering the much dreaded > concurrent-mode-failure > > >>>>>>>>> whereby the entire heap must be GCed. > > >>>>>>>>> > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > > >>>>>>>>> > > >>>>>>>>> We could get a JNI thread to do this, but no one has done so > yet. > > I > > >>>>>>>>> > > >>>>>>>> am > > >>>> > > >>>>> personally hoping for G1 and in the meantime overprovision our Xmx > > >>>>>>>>> > > >>>>>>>> to > > >>>> > > >>>>> avoid the concurrent mode failures. > > >>>>>>>>> > > >>>>>>>>> -ryan > > >>>>>>>>> > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > >>>>>>>>> > > >>>>>>>> zhongresearch@...> > > >>>>> > > >>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> Ryan, > > >>>>>>>>>> > > >>>>>>>>>> Thank you very much. > > >>>>>>>>>> May I ask whether there are any ways to get around this > problem > > to > > >>>>>>>>>> > > >>>>>>>>> make > > >>>>> > > >>>>>> HBase more stable? > > >>>>>>>>>> > > >>>>>>>>>> best, > > >>>>>>>>>> zhong > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > > ryanobjc@...> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a > prototype. > > >>>>>>>>>> > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we > could > > >>>>>>>>>>> get > > >>>>>>>>>>> > > >>>>>>>>>> rid > > >>>>> > > >>>>>> of all long pauses, and avoid the need for this. > > >>>>>>>>>>> > > >>>>>>>>>>> -ryan > > >>>>>>>>>>> > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > >>>>>>>>>>> zhongresearch@...> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi, > > >>>>>>>>>>>> > > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed and > > >>>>>>>>>>>> > > >>>>>>>>>>> would > > >>>> > > >>>>> like > > >>>>>>>>>>> > > >>>>>>>>>> to > > >>>>>>>>>> > > >>>>>>>>>>> have a try. > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper > in > > >>>>>>>>>>>> > > >>>>>>>>>>> JNI > > >>>> > > >>>>> integration? > > >>>>>>>>>>>> > > >>>>>>>>>>>> I would be very appreciated. > > >>>>>>>>>>>> > > >>>>>>>>>>>> thanks > > >>>>>>>>>>>> zhong > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>> > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <zhongresearch@...>wrote:
> I have 19 quorum members now. > > Thats too many. Have 3 or maybe 5. See zk site for rationale. > When I did test on loading data to two columnfamilies of one table in HBase > using two seperate MR jobs, I lost my regionserver and the test failed. > > Does HBase allow such table update operation? > > The errors I got while I lost my regionserver is: > 2009-10-29 21:09:34,705 INFO org.apache.hadoop.hbase.regionserver.HLog: > Roll > /hbase/.logs/YYYY,60021,1256849619429/hlog.d > at.1256849620029, entries=271911, calcsize=63754142, filesize=33975611. New > hlog /hbase/.logs/YYYY,60021,1256849619429/hl > og.dat.1256850574705 > 2009-10-29 21:09:50,322 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > You have read the 'Getting Started' and made the necessary changes to filedescriptors and xceiver count? You will see above message after a regionserver has restarted and tries to go back to the master (what hbase is this? I think you said it 0.20.x). > java.io.IOException: TIMED OUT > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > 2009-10-29 21:09:50,873 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: > null > This is timeout against zk. You've lost your session. The RS will go down. The connection to zk is basic to hbase. Something is up. In the past others have reported things like incorrect bios settings on disks that have made the disks run slow or just something up with the networking. Can you check all is healthy? You seem to be having too many issues for such a small loading with such a large cluster. St.Ack > > > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <zhongresearch@... > > >wrote: > > > > > FYI > > > It looks like increasing the number of Zookeeper Quorums can solve the > > > following error message : org.apache.hadoop.hbase. > > > client.NoServerForRegionException: Timed out trying to locate root > region > > > at > > > org.apache.hadoop.hbase. > > > > > > You mean quorum members? How many do you have now? > > > > > > > > > Now I am running Zookeeper quorum on each node I have. > > > However, I am still having issues about losing regionserver. > > > > > > Whats in the logs? > > > > > > > > > > > Is there a way to browse the Znode in zookeeper? > > > > > > > > Type 'zk' in the hbase shell. > > > > You can get to the zk shell from hbase shell. You so things like: > > > > > zk "ls /" > > > > (Yes, quotes needed). > > > > St.Ack > > > > > > > > > thanks > > > zhenyu > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <zhongresearch@... > > > >wrote: > > > > > > > JG, > > > > > > > > > > > > Thanks a lot for the tips. > > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > > > -XX:+UseConcMarkSweepGC. > > > > > > > > I checked the logs in my Master an RS and found the following errors. > > > > Basically, master got exception error while scanning ROOT, then the > > ROOT > > > > region was offline and unset. Thus the regionserver can't get > > > > NotservingRegion errors. > > > > > > > > In the master: > > > > 2009-10-28 19:00:30,591 INFO > > org.apache.hadoop.hbase.master.BaseScanner: > > > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > > > 2009-10-28 19:00:30,591 WARN > > org.apache.hadoop.hbase.master.BaseScanner: > > > > Scan ROOT region > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local > exception: > > > > java.io.EOFException > > > > at > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > > > at > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > > > at > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > at $Proxy1.openScanner(Unknown Source) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > at > > > > > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > > > at > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > Caused by: java.io.EOFException > > > > at java.io.DataInputStream.readInt(DataInputStream.java:375) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > > > 2009-10-28 19:00:30,591 INFO > > org.apache.hadoop.hbase.master.BaseScanner: > > > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > > > x:60021, regionname: .META.,,1, startKey: <>} > > > > 2009-10-28 19:00:30,591 WARN > > org.apache.hadoop.hbase.master.BaseScanner: > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > > > ETA.,,1, startKey: <>} > > > > java.net.ConnectException: Connection refused > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > > > at > > > > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > > > at > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > > > at > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > > > at > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > at $Proxy1.openScanner(Unknown Source) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > > > at > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > 2009-10-28 19:00:30,591 INFO > > org.apache.hadoop.hbase.master.BaseScanner: > > > > All 1 .META. region(s) scanned > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.ServerManager: > > > > Removing server's info YYYY,60021,125675547057 > > > > 0 > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.RegionManager: > > > > Offlined ROOT server: x.x.x.x:60021 > > > > > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.RegionManager: > > > > -ROOT- region unset (but not set to be reassigned) > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.RegionManager: > > > > ROOT inserted into regionsInTransition > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.RegionManager: > > > > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > > > > startKey: <>} > > > > 2009-10-28 19:00:31,395 INFO > > > org.apache.hadoop.hbase.master.RegionManager: > > > > META region removed from onlineMetaRegions > > > > > > > > > > > > > > > > On the regionserver: > > > > 2009-10-28 18:51:14,578 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > > > > test,,1256755871065 > > > > 2009-10-28 18:51:14,578 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > > MSG_REGION_OPEN: > > > > test,,1256755871065 > > > > 2009-10-28 18:51:14,578 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > region test,,1256755871065/796855017 available; sequence id is > 10013291 > > > > 2009-10-28 18:51:14,578 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Starting compaction on region test,,1256755871065 > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping > > > > response for sessionid:0x249c76021d0001 after 0ms > > > > 2009-10-28 18:51:19,341 ERROR > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > org.apache.hadoop.hbase.NotServingRegionException: > test,,1256754924503 > > > > at > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > > > at > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown > Source) > > > > at > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > at > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > > at > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > Server > > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, > > maxVersions=1, > > > > timeRange=[0,9223372036854775807), > families={(family=email_ip_activity, > > > > columns=ALL}) from x.x.x.x:54669: error: > > > > org.apache.hadoop.hbase.NotServingRegionException: > test,,1256754924503 > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> > > > wrote: > > > > > > > >> These client error messages are not particular descriptive as to the > > > root > > > >> cause (they are fatal errors, or close to it). > > > >> > > > >> What is going on in your regionservers when these errors happen? > > Check > > > >> the master and RS logs. > > > >> > > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to > 3 > > > or > > > >> 5 max. > > > >> > > > >> What is the hardware you are using for these nodes, and what > settings > > do > > > >> you have for heap/GC? > > > >> > > > >> JG > > > >> > > > >> > > > >> Zhenyu Zhong wrote: > > > >> > > > >>> Stack, > > > >>> > > > >>> Thank you very much for your comments. > > > >>> I am running a cluster with 20 nodes. I set 19 as both regionserver > > and > > > >>> zookeeper quorums. > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > > > >>> I started with an empty table and try to load 200 million records > > into > > > >>> that > > > >>> empty table. > > > >>> There is a key in each record. Logically, in my MR program, during > > the > > > >>> setup, I opened an HTable, in my mapper, I fetch the record from > > HTable > > > >>> via > > > >>> key in the record, then make some changes to the columns and update > > > that > > > >>> row > > > >>> back to HTable through TableOutputFormat by passing a put. There is > > no > > > >>> reduce tasks involved here. (Though it is unnecessary to fetch row > > > from > > > >>> an > > > >>> empty table, I just intended to do that) > > > >>> > > > >>> Additionally, when I reduced the number of regionservers and number > > of > > > >>> zookeeper quorums. > > > >>> I had different errors: > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed > out > > > >>> trying > > > >>> to locate root region at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > > > >>> at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > > > >>> > > > >>> > > > > > > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > > > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) > at > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > > > >>> > > > >>> Many thanks in advance. > > > >>> zhenyu > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> wrote: > > > >>> > > > >>> Whats your cluster topology? How many nodes involved? When you > see > > > the > > > >>>> below message, how many regions in your table? How are you > loading > > > your > > > >>>> table? > > > >>>> Thanks, > > > >>>> St.Ack > > > >>>> > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > > > zhongresearch@... > > > >>>> > > > >>>>> wrote: > > > >>>>> Nitay, > > > >>>>> > > > >>>>> I am very appreciated. > > > >>>>> > > > >>>>> As Ryan suggested, I increased the zookeeper session timeout to > > > >>>>> 40seconds > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > > > >>>>> > > > >>>> -XX:+UseConcMarkSweepGC > > > >>>> > > > >>>>> in place. I set the Heapsize to 4GB. I also set the > > vm.swappiness=0. > > > >>>>> > > > >>>>> However it still ran into problem. Please find the following > > errors. > > > >>>>> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying > to > > > >>>>> contact region server x.x.x.x:60021 for region > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed > after > > > 10 > > > >>>>> attempts. > > > >>>>> Exceptions: > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > >>>>> > > > >>>>> at > > > >>>>> > > > >>>>> > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > > >>>> > > > >>>>> at > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > >>>>> > > > >>>>> > > > >>>>> The input file is about 10GB around 200million rows of data. > > > >>>>> This load doesn't seem too large. However this kind of errors > keep > > > >>>>> > > > >>>> popping > > > >>>> > > > >>>>> up. > > > >>>>> > > > >>>>> Does Regionserver need to be deployed to dedicated machines? > > > >>>>> Does Zookeeper need to be deployed to dedicated machines as well? > > > >>>>> > > > >>>>> Best, > > > >>>>> zhenyu > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> wrote: > > > >>>>> > > > >>>>> Hi Zhenyu, > > > >>>>>> > > > >>>>>> Sorry for the delay. I started working on this a while back, > > before > > > I > > > >>>>>> > > > >>>>> left > > > >>>>> > > > >>>>>> my job for another company. Since then I haven't had much time > to > > > work > > > >>>>>> > > > >>>>> on > > > >>>> > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see > what > > > >>>>>> > > > >>>>> shape > > > >>>> > > > >>>>> it's in and update you. > > > >>>>>> > > > >>>>>> Cheers, > > > >>>>>> -n > > > >>>>>> > > > >>>>>> > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > >>>>>> > > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You > can > > > >>>>>> > > > >>>>>>> still see multi-second pauses at times, so you need to give > > > yourself > > > >>>>>>> a > > > >>>>>>> bigger buffer. > > > >>>>>>> > > > >>>>>>> The parallel threads argument should not be necessary, but you > do > > > >>>>>>> need > > > >>>>>>> the UseConcMarkSweepGC flag as well. > > > >>>>>>> > > > >>>>>>> Let us know how it goes! > > > >>>>>>> -ryan > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > > > >>>>>>> > > > >>>>>> zhongresearch@...> > > > >>>> > > > >>>>> wrote: > > > >>>>>>> > > > >>>>>>> Ryan, > > > >>>>>>>> I am very appreciated for your feedbacks. > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is > way > > > >>>>>>>> > > > >>>>>>> higher > > > >>>> > > > >>>>> than > > > >>>>>>>> 40ms. > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > > > >>>>>>>> sufficient. > > > >>>>>>>> I also tried GC options like > > > >>>>>>>> > > > >>>>>>>> -XX:ParallelGCThreads=8 > > > >>>>>>>> -XX:+UseConcMarkSweepGC > > > >>>>>>>> > > > >>>>>>>> I even set the vm.swappiness=0 > > > >>>>>>>> > > > >>>>>>>> However, I still came across the problem that a RegionServer > > > >>>>>>>> shutdown > > > >>>>>>>> itself. > > > >>>>>>>> > > > >>>>>>>> Best, > > > >>>>>>>> zhong > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > > ryanobjc@...> > > > >>>>>>>> > > > >>>>>>> wrote: > > > >>>>> > > > >>>>>> Set the ZK timeout to something like 40ms, and give the GC > > enough > > > >>>>>>>> > > > >>>>>>> Xmx > > > >>>> > > > >>>>> so you never risk entering the much dreaded > > concurrent-mode-failure > > > >>>>>>>>> whereby the entire heap must be GCed. > > > >>>>>>>>> > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > > > >>>>>>>>> > > > >>>>>>>>> We could get a JNI thread to do this, but no one has done so > > yet. > > > I > > > >>>>>>>>> > > > >>>>>>>> am > > > >>>> > > > >>>>> personally hoping for G1 and in the meantime overprovision our > Xmx > > > >>>>>>>>> > > > >>>>>>>> to > > > >>>> > > > >>>>> avoid the concurrent mode failures. > > > >>>>>>>>> > > > >>>>>>>>> -ryan > > > >>>>>>>>> > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > > >>>>>>>>> > > > >>>>>>>> zhongresearch@...> > > > >>>>> > > > >>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>> Ryan, > > > >>>>>>>>>> > > > >>>>>>>>>> Thank you very much. > > > >>>>>>>>>> May I ask whether there are any ways to get around this > > problem > > > to > > > >>>>>>>>>> > > > >>>>>>>>> make > > > >>>>> > > > >>>>>> HBase more stable? > > > >>>>>>>>>> > > > >>>>>>>>>> best, > > > >>>>>>>>>> zhong > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > > > ryanobjc@...> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a > > prototype. > > > >>>>>>>>>> > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we > > could > > > >>>>>>>>>>> get > > > >>>>>>>>>>> > > > >>>>>>>>>> rid > > > >>>>> > > > >>>>>> of all long pauses, and avoid the need for this. > > > >>>>>>>>>>> > > > >>>>>>>>>>> -ryan > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > > >>>>>>>>>>> zhongresearch@...> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> Hi, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed > and > > > >>>>>>>>>>>> > > > >>>>>>>>>>> would > > > >>>> > > > >>>>> like > > > >>>>>>>>>>> > > > >>>>>>>>>> to > > > >>>>>>>>>> > > > >>>>>>>>>>> have a try. > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this > zk_wrapper > > in > > > >>>>>>>>>>>> > > > >>>>>>>>>>> JNI > > > >>>> > > > >>>>> integration? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I would be very appreciated. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> thanks > > > >>>>>>>>>>>> zhong > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>> > > > > > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Anything that possibly gets started is another MR job working on other
dataset in the same time as this test was running. So some node might be under heavy loads. I am wondering whether that would cause the connection timeout. thanks zhenyu On Thu, Oct 29, 2009 at 5:32 PM, stack <stack@...> wrote: > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <zhongresearch@... > >wrote: > > > I have 19 quorum members now. > > > > Thats too many. Have 3 or maybe 5. See zk site for rationale. > > > > > When I did test on loading data to two columnfamilies of one table in > HBase > > using two seperate MR jobs, I lost my regionserver and the test failed. > > > > Does HBase allow such table update operation? > > > > The errors I got while I lost my regionserver is: > > 2009-10-29 21:09:34,705 INFO org.apache.hadoop.hbase.regionserver.HLog: > > Roll > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d > > at.1256849620029, entries=271911, calcsize=63754142, filesize=33975611. > New > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl > > og.dat.1256850574705 > > 2009-10-29 21:09:50,322 WARN > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > > > > > You have read the 'Getting Started' and made the necessary changes to > filedescriptors and xceiver count? > > You will see above message after a regionserver has restarted and tries to > go back to the master (what hbase is this? I think you said it 0.20.x). > > > > > > java.io.IOException: TIMED OUT > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > > 2009-10-29 21:09:50,873 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > > state: Disconnected, type: None, path: > > null > > > > This is timeout against zk. You've lost your session. The RS will go > down. The connection to zk is basic to hbase. Something is up. In the > past others have reported things like incorrect bios settings on disks that > have made the disks run slow or just something up with the networking. Can > you check all is healthy? You seem to be having too many issues for such a > small loading with such a large cluster. > > St.Ack > > > > > > > > > > > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: > > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong < > zhongresearch@... > > > >wrote: > > > > > > > FYI > > > > It looks like increasing the number of Zookeeper Quorums can solve > the > > > > following error message : org.apache.hadoop.hbase. > > > > client.NoServerForRegionException: Timed out trying to locate root > > region > > > > at > > > > org.apache.hadoop.hbase. > > > > > > > > You mean quorum members? How many do you have now? > > > > > > > > > > > > > Now I am running Zookeeper quorum on each node I have. > > > > However, I am still having issues about losing regionserver. > > > > > > > > Whats in the logs? > > > > > > > > > > > > > > > > Is there a way to browse the Znode in zookeeper? > > > > > > > > > > > Type 'zk' in the hbase shell. > > > > > > You can get to the zk shell from hbase shell. You so things like: > > > > > > > zk "ls /" > > > > > > (Yes, quotes needed). > > > > > > St.Ack > > > > > > > > > > > > > thanks > > > > zhenyu > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong < > zhongresearch@... > > > > >wrote: > > > > > > > > > JG, > > > > > > > > > > > > > > > Thanks a lot for the tips. > > > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > > > > -XX:+UseConcMarkSweepGC. > > > > > > > > > > I checked the logs in my Master an RS and found the following > errors. > > > > > Basically, master got exception error while scanning ROOT, then the > > > ROOT > > > > > region was offline and unset. Thus the regionserver can't get > > > > > NotservingRegion errors. > > > > > > > > > > In the master: > > > > > 2009-10-28 19:00:30,591 INFO > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > > > > 2009-10-28 19:00:30,591 WARN > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > Scan ROOT region > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local > > exception: > > > > > java.io.EOFException > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > > > > at > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > at > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > > > > at > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > Caused by: java.io.EOFException > > > > > at > java.io.DataInputStream.readInt(DataInputStream.java:375) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > > > > 2009-10-28 19:00:30,591 INFO > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > > > > x:60021, regionname: .META.,,1, startKey: <>} > > > > > 2009-10-28 19:00:30,591 WARN > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > > > > ETA.,,1, startKey: <>} > > > > > java.net.ConnectException: Connection refused > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > > > > at > > > > > > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > > > at > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > > > > at > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > > > > at > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > 2009-10-28 19:00:30,591 INFO > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > All 1 .META. region(s) scanned > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.ServerManager: > > > > > Removing server's info YYYY,60021,125675547057 > > > > > 0 > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > Offlined ROOT server: x.x.x.x:60021 > > > > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > -ROOT- region unset (but not set to be reassigned) > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > ROOT inserted into regionsInTransition > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > Offlining META region: {server: x.x.x.x:60021, regionname: > .META.,,1, > > > > > startKey: <>} > > > > > 2009-10-28 19:00:31,395 INFO > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > META region removed from onlineMetaRegions > > > > > > > > > > > > > > > > > > > > On the regionserver: > > > > > 2009-10-28 18:51:14,578 INFO > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > MSG_REGION_OPEN: > > > > > test,,1256755871065 > > > > > 2009-10-28 18:51:14,578 INFO > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > > > MSG_REGION_OPEN: > > > > > test,,1256755871065 > > > > > 2009-10-28 18:51:14,578 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > region test,,1256755871065/796855017 available; sequence id is > > 10013291 > > > > > 2009-10-28 18:51:14,578 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > Starting compaction on region test,,1256755871065 > > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > > > > response for sessionid:0x249c76021d0001 after 0ms > > > > > 2009-10-28 18:51:19,341 ERROR > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > test,,1256754924503 > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown > > Source) > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > at > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, > > > maxVersions=1, > > > > > timeRange=[0,9223372036854775807), > > families={(family=email_ip_activity, > > > > > columns=ALL}) from x.x.x.x:54669: error: > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > test,,1256754924503 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@...> > > > > wrote: > > > > > > > > > >> These client error messages are not particular descriptive as to > the > > > > root > > > > >> cause (they are fatal errors, or close to it). > > > > >> > > > > >> What is going on in your regionservers when these errors happen? > > > Check > > > > >> the master and RS logs. > > > > >> > > > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce that > to > > 3 > > > > or > > > > >> 5 max. > > > > >> > > > > >> What is the hardware you are using for these nodes, and what > > settings > > > do > > > > >> you have for heap/GC? > > > > >> > > > > >> JG > > > > >> > > > > >> > > > > >> Zhenyu Zhong wrote: > > > > >> > > > > >>> Stack, > > > > >>> > > > > >>> Thank you very much for your comments. > > > > >>> I am running a cluster with 20 nodes. I set 19 as both > regionserver > > > and > > > > >>> zookeeper quorums. > > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > > > > >>> I started with an empty table and try to load 200 million records > > > into > > > > >>> that > > > > >>> empty table. > > > > >>> There is a key in each record. Logically, in my MR program, > during > > > the > > > > >>> setup, I opened an HTable, in my mapper, I fetch the record from > > > HTable > > > > >>> via > > > > >>> key in the record, then make some changes to the columns and > update > > > > that > > > > >>> row > > > > >>> back to HTable through TableOutputFormat by passing a put. There > is > > > no > > > > >>> reduce tasks involved here. (Though it is unnecessary to fetch > row > > > > from > > > > >>> an > > > > >>> empty table, I just intended to do that) > > > > >>> > > > > >>> Additionally, when I reduced the number of regionservers and > number > > > of > > > > >>> zookeeper quorums. > > > > >>> I had different errors: > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed > > out > > > > >>> trying > > > > >>> to locate root region at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > > > > >>> at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > > > > >>> > > > > >>> > > > > > > > > > > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > > > > >>> at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) > > at > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > >>> > > > > >>> Many thanks in advance. > > > > >>> zhenyu > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> > wrote: > > > > >>> > > > > >>> Whats your cluster topology? How many nodes involved? When you > > see > > > > the > > > > >>>> below message, how many regions in your table? How are you > > loading > > > > your > > > > >>>> table? > > > > >>>> Thanks, > > > > >>>> St.Ack > > > > >>>> > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > > > > zhongresearch@... > > > > >>>> > > > > >>>>> wrote: > > > > >>>>> Nitay, > > > > >>>>> > > > > >>>>> I am very appreciated. > > > > >>>>> > > > > >>>>> As Ryan suggested, I increased the zookeeper session timeout to > > > > >>>>> 40seconds > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > > > > >>>>> > > > > >>>> -XX:+UseConcMarkSweepGC > > > > >>>> > > > > >>>>> in place. I set the Heapsize to 4GB. I also set the > > > vm.swappiness=0. > > > > >>>>> > > > > >>>>> However it still ran into problem. Please find the following > > > errors. > > > > >>>>> > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Trying > > to > > > > >>>>> contact region server x.x.x.x:60021 for region > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed > > after > > > > 10 > > > > >>>>> attempts. > > > > >>>>> Exceptions: > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > >>>>> > > > > >>>>> at > > > > >>>>> > > > > >>>>> > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > > > >>>> > > > > >>>>> at > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > > >>>>> > > > > >>>>> > > > > >>>>> The input file is about 10GB around 200million rows of data. > > > > >>>>> This load doesn't seem too large. However this kind of errors > > keep > > > > >>>>> > > > > >>>> popping > > > > >>>> > > > > >>>>> up. > > > > >>>>> > > > > >>>>> Does Regionserver need to be deployed to dedicated machines? > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines as > well? > > > > >>>>> > > > > >>>>> Best, > > > > >>>>> zhenyu > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> > wrote: > > > > >>>>> > > > > >>>>> Hi Zhenyu, > > > > >>>>>> > > > > >>>>>> Sorry for the delay. I started working on this a while back, > > > before > > > > I > > > > >>>>>> > > > > >>>>> left > > > > >>>>> > > > > >>>>>> my job for another company. Since then I haven't had much time > > to > > > > work > > > > >>>>>> > > > > >>>>> on > > > > >>>> > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see > > what > > > > >>>>>> > > > > >>>>> shape > > > > >>>> > > > > >>>>> it's in and update you. > > > > >>>>>> > > > > >>>>>> Cheers, > > > > >>>>>> -n > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > > >>>>>> > > > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You > > can > > > > >>>>>> > > > > >>>>>>> still see multi-second pauses at times, so you need to give > > > > yourself > > > > >>>>>>> a > > > > >>>>>>> bigger buffer. > > > > >>>>>>> > > > > >>>>>>> The parallel threads argument should not be necessary, but > you > > do > > > > >>>>>>> need > > > > >>>>>>> the UseConcMarkSweepGC flag as well. > > > > >>>>>>> > > > > >>>>>>> Let us know how it goes! > > > > >>>>>>> -ryan > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > > > > >>>>>>> > > > > >>>>>> zhongresearch@...> > > > > >>>> > > > > >>>>> wrote: > > > > >>>>>>> > > > > >>>>>>> Ryan, > > > > >>>>>>>> I am very appreciated for your feedbacks. > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is > > way > > > > >>>>>>>> > > > > >>>>>>> higher > > > > >>>> > > > > >>>>> than > > > > >>>>>>>> 40ms. > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > > > > >>>>>>>> sufficient. > > > > >>>>>>>> I also tried GC options like > > > > >>>>>>>> > > > > >>>>>>>> -XX:ParallelGCThreads=8 > > > > >>>>>>>> -XX:+UseConcMarkSweepGC > > > > >>>>>>>> > > > > >>>>>>>> I even set the vm.swappiness=0 > > > > >>>>>>>> > > > > >>>>>>>> However, I still came across the problem that a RegionServer > > > > >>>>>>>> shutdown > > > > >>>>>>>> itself. > > > > >>>>>>>> > > > > >>>>>>>> Best, > > > > >>>>>>>> zhong > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > > > ryanobjc@...> > > > > >>>>>>>> > > > > >>>>>>> wrote: > > > > >>>>> > > > > >>>>>> Set the ZK timeout to something like 40ms, and give the GC > > > enough > > > > >>>>>>>> > > > > >>>>>>> Xmx > > > > >>>> > > > > >>>>> so you never risk entering the much dreaded > > > concurrent-mode-failure > > > > >>>>>>>>> whereby the entire heap must be GCed. > > > > >>>>>>>>> > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > > > > >>>>>>>>> > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has done > so > > > yet. > > > > I > > > > >>>>>>>>> > > > > >>>>>>>> am > > > > >>>> > > > > >>>>> personally hoping for G1 and in the meantime overprovision our > > Xmx > > > > >>>>>>>>> > > > > >>>>>>>> to > > > > >>>> > > > > >>>>> avoid the concurrent mode failures. > > > > >>>>>>>>> > > > > >>>>>>>>> -ryan > > > > >>>>>>>>> > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > > > >>>>>>>>> > > > > >>>>>>>> zhongresearch@...> > > > > >>>>> > > > > >>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>> Ryan, > > > > >>>>>>>>>> > > > > >>>>>>>>>> Thank you very much. > > > > >>>>>>>>>> May I ask whether there are any ways to get around this > > > problem > > > > to > > > > >>>>>>>>>> > > > > >>>>>>>>> make > > > > >>>>> > > > > >>>>>> HBase more stable? > > > > >>>>>>>>>> > > > > >>>>>>>>>> best, > > > > >>>>>>>>>> zhong > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > > > > ryanobjc@...> > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a > > > prototype. > > > > >>>>>>>>>> > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we > > > could > > > > >>>>>>>>>>> get > > > > >>>>>>>>>>> > > > > >>>>>>>>>> rid > > > > >>>>> > > > > >>>>>> of all long pauses, and avoid the need for this. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> -ryan > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > > > >>>>>>>>>>> zhongresearch@...> > > > > >>>>>>>>>>> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Hi, > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed > > and > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> would > > > > >>>> > > > > >>>>> like > > > > >>>>>>>>>>> > > > > >>>>>>>>>> to > > > > >>>>>>>>>> > > > > >>>>>>>>>>> have a try. > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this > > zk_wrapper > > > in > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> JNI > > > > >>>> > > > > >>>>> integration? > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> I would be very appreciated. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> thanks > > > > >>>>>>>>>>>> zhong > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>> > > > > > > > > > > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)If it stole machine resources, yeah, it could. Do you have anything to
watch your cluster with in place? Ganglia or some such so you can watch the loadings? Is the machine with the RS that is going down swapping? You could try upping your zk session timeout in your hbase cluster. St.Ack On Thu, Oct 29, 2009 at 3:00 PM, Zhenyu Zhong <zhongresearch@...>wrote: > Anything that possibly gets started is another MR job working on other > dataset in the same time as this test was running. So some node might be > under heavy loads. > I am wondering whether that would cause the connection timeout. > > thanks > zhenyu > > > > On Thu, Oct 29, 2009 at 5:32 PM, stack <stack@...> wrote: > > > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <zhongresearch@... > > >wrote: > > > > > I have 19 quorum members now. > > > > > > Thats too many. Have 3 or maybe 5. See zk site for rationale. > > > > > > > > > When I did test on loading data to two columnfamilies of one table in > > HBase > > > using two seperate MR jobs, I lost my regionserver and the test failed. > > > > > > Does HBase allow such table update operation? > > > > > > The errors I got while I lost my regionserver is: > > > 2009-10-29 21:09:34,705 INFO org.apache.hadoop.hbase.regionserver.HLog: > > > Roll > > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d > > > at.1256849620029, entries=271911, calcsize=63754142, filesize=33975611. > > New > > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl > > > og.dat.1256850574705 > > > 2009-10-29 21:09:50,322 WARN > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > > > > > > > > > You have read the 'Getting Started' and made the necessary changes to > > filedescriptors and xceiver count? > > > > You will see above message after a regionserver has restarted and tries > to > > go back to the master (what hbase is this? I think you said it 0.20.x). > > > > > > > > > > > java.io.IOException: TIMED OUT > > > at > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > > > 2009-10-29 21:09:50,873 INFO > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper > event, > > > state: Disconnected, type: None, path: > > > null > > > > > > > This is timeout against zk. You've lost your session. The RS will go > > down. The connection to zk is basic to hbase. Something is up. In the > > past others have reported things like incorrect bios settings on disks > that > > have made the disks run slow or just something up with the networking. > Can > > you check all is healthy? You seem to be having too many issues for such > a > > small loading with such a large cluster. > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: > > > > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong < > > zhongresearch@... > > > > >wrote: > > > > > > > > > FYI > > > > > It looks like increasing the number of Zookeeper Quorums can solve > > the > > > > > following error message : org.apache.hadoop.hbase. > > > > > client.NoServerForRegionException: Timed out trying to locate root > > > region > > > > > at > > > > > org.apache.hadoop.hbase. > > > > > > > > > > You mean quorum members? How many do you have now? > > > > > > > > > > > > > > > > > Now I am running Zookeeper quorum on each node I have. > > > > > However, I am still having issues about losing regionserver. > > > > > > > > > > Whats in the logs? > > > > > > > > > > > > > > > > > > > > > Is there a way to browse the Znode in zookeeper? > > > > > > > > > > > > > > Type 'zk' in the hbase shell. > > > > > > > > You can get to the zk shell from hbase shell. You so things like: > > > > > > > > > zk "ls /" > > > > > > > > (Yes, quotes needed). > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > thanks > > > > > zhenyu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong < > > zhongresearch@... > > > > > >wrote: > > > > > > > > > > > JG, > > > > > > > > > > > > > > > > > > Thanks a lot for the tips. > > > > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > > > > > -XX:+UseConcMarkSweepGC. > > > > > > > > > > > > I checked the logs in my Master an RS and found the following > > errors. > > > > > > Basically, master got exception error while scanning ROOT, then > the > > > > ROOT > > > > > > region was offline and unset. Thus the regionserver can't get > > > > > > NotservingRegion errors. > > > > > > > > > > > > In the master: > > > > > > 2009-10-28 19:00:30,591 INFO > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > > > > > 2009-10-28 19:00:30,591 WARN > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > Scan ROOT region > > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local > > > exception: > > > > > > java.io.EOFException > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > > Caused by: java.io.EOFException > > > > > > at > > java.io.DataInputStream.readInt(DataInputStream.java:375) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > > > > > 2009-10-28 19:00:30,591 INFO > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > > > > > x:60021, regionname: .META.,,1, startKey: <>} > > > > > > 2009-10-28 19:00:30,591 WARN > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > > > > > ETA.,,1, startKey: <>} > > > > > > java.net.ConnectException: Connection refused > > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native > Method) > > > > > > at > > > > > > > > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > > > > at > > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > > 2009-10-28 19:00:30,591 INFO > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > All 1 .META. region(s) scanned > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.ServerManager: > > > > > > Removing server's info YYYY,60021,125675547057 > > > > > > 0 > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > Offlined ROOT server: x.x.x.x:60021 > > > > > > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > -ROOT- region unset (but not set to be reassigned) > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > ROOT inserted into regionsInTransition > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > Offlining META region: {server: x.x.x.x:60021, regionname: > > .META.,,1, > > > > > > startKey: <>} > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > META region removed from onlineMetaRegions > > > > > > > > > > > > > > > > > > > > > > > > On the regionserver: > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > MSG_REGION_OPEN: > > > > > > test,,1256755871065 > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > > > > MSG_REGION_OPEN: > > > > > > test,,1256755871065 > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > > region test,,1256755871065/796855017 available; sequence id is > > > 10013291 > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > > Starting compaction on region test,,1256755871065 > > > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: > Got > > > ping > > > > > > response for sessionid:0x249c76021d0001 after 0ms > > > > > > 2009-10-28 18:51:19,341 ERROR > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > test,,1256754924503 > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > > > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown > > > Source) > > > > > > at > > > > > > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: > IPC > > > > > Server > > > > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, > > > > maxVersions=1, > > > > > > timeRange=[0,9223372036854775807), > > > families={(family=email_ip_activity, > > > > > > columns=ALL}) from x.x.x.x:54669: error: > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > test,,1256754924503 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray < > jlist@...> > > > > > wrote: > > > > > > > > > > > >> These client error messages are not particular descriptive as to > > the > > > > > root > > > > > >> cause (they are fatal errors, or close to it). > > > > > >> > > > > > >> What is going on in your regionservers when these errors happen? > > > > Check > > > > > >> the master and RS logs. > > > > > >> > > > > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce > that > > to > > > 3 > > > > > or > > > > > >> 5 max. > > > > > >> > > > > > >> What is the hardware you are using for these nodes, and what > > > settings > > > > do > > > > > >> you have for heap/GC? > > > > > >> > > > > > >> JG > > > > > >> > > > > > >> > > > > > >> Zhenyu Zhong wrote: > > > > > >> > > > > > >>> Stack, > > > > > >>> > > > > > >>> Thank you very much for your comments. > > > > > >>> I am running a cluster with 20 nodes. I set 19 as both > > regionserver > > > > and > > > > > >>> zookeeper quorums. > > > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > > > > > >>> I started with an empty table and try to load 200 million > records > > > > into > > > > > >>> that > > > > > >>> empty table. > > > > > >>> There is a key in each record. Logically, in my MR program, > > during > > > > the > > > > > >>> setup, I opened an HTable, in my mapper, I fetch the record > from > > > > HTable > > > > > >>> via > > > > > >>> key in the record, then make some changes to the columns and > > update > > > > > that > > > > > >>> row > > > > > >>> back to HTable through TableOutputFormat by passing a put. > There > > is > > > > no > > > > > >>> reduce tasks involved here. (Though it is unnecessary to fetch > > row > > > > > from > > > > > >>> an > > > > > >>> empty table, I just intended to do that) > > > > > >>> > > > > > >>> Additionally, when I reduced the number of regionservers and > > number > > > > of > > > > > >>> zookeeper quorums. > > > > > >>> I had different errors: > > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: > Timed > > > out > > > > > >>> trying > > > > > >>> to locate root region at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > > > > > >>> at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > > > > > >>> at > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) > > > at > > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > > >>> > > > > > >>> Many thanks in advance. > > > > > >>> zhenyu > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> > > wrote: > > > > > >>> > > > > > >>> Whats your cluster topology? How many nodes involved? When > you > > > see > > > > > the > > > > > >>>> below message, how many regions in your table? How are you > > > loading > > > > > your > > > > > >>>> table? > > > > > >>>> Thanks, > > > > > >>>> St.Ack > > > > > >>>> > > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > > > > > zhongresearch@... > > > > > >>>> > > > > > >>>>> wrote: > > > > > >>>>> Nitay, > > > > > >>>>> > > > > > >>>>> I am very appreciated. > > > > > >>>>> > > > > > >>>>> As Ryan suggested, I increased the zookeeper session timeout > to > > > > > >>>>> 40seconds > > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > > > > > >>>>> > > > > > >>>> -XX:+UseConcMarkSweepGC > > > > > >>>> > > > > > >>>>> in place. I set the Heapsize to 4GB. I also set the > > > > vm.swappiness=0. > > > > > >>>>> > > > > > >>>>> However it still ran into problem. Please find the following > > > > errors. > > > > > >>>>> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Trying > > > to > > > > > >>>>> contact region server x.x.x.x:60021 for region > > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed > > > after > > > > > 10 > > > > > >>>>> attempts. > > > > > >>>>> Exceptions: > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > Failed > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > >>>>> > > > > > >>>>> at > > > > > >>>>> > > > > > >>>>> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > > > > >>>> > > > > > >>>>> at > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> The input file is about 10GB around 200million rows of data. > > > > > >>>>> This load doesn't seem too large. However this kind of errors > > > keep > > > > > >>>>> > > > > > >>>> popping > > > > > >>>> > > > > > >>>>> up. > > > > > >>>>> > > > > > >>>>> Does Regionserver need to be deployed to dedicated machines? > > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines as > > well? > > > > > >>>>> > > > > > >>>>> Best, > > > > > >>>>> zhenyu > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> > > wrote: > > > > > >>>>> > > > > > >>>>> Hi Zhenyu, > > > > > >>>>>> > > > > > >>>>>> Sorry for the delay. I started working on this a while back, > > > > before > > > > > I > > > > > >>>>>> > > > > > >>>>> left > > > > > >>>>> > > > > > >>>>>> my job for another company. Since then I haven't had much > time > > > to > > > > > work > > > > > >>>>>> > > > > > >>>>> on > > > > > >>>> > > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see > > > what > > > > > >>>>>> > > > > > >>>>> shape > > > > > >>>> > > > > > >>>>> it's in and update you. > > > > > >>>>>> > > > > > >>>>>> Cheers, > > > > > >>>>>> -n > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > > > >>>>>> > > > > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". > You > > > can > > > > > >>>>>> > > > > > >>>>>>> still see multi-second pauses at times, so you need to give > > > > > yourself > > > > > >>>>>>> a > > > > > >>>>>>> bigger buffer. > > > > > >>>>>>> > > > > > >>>>>>> The parallel threads argument should not be necessary, but > > you > > > do > > > > > >>>>>>> need > > > > > >>>>>>> the UseConcMarkSweepGC flag as well. > > > > > >>>>>>> > > > > > >>>>>>> Let us know how it goes! > > > > > >>>>>>> -ryan > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > > > > > >>>>>>> > > > > > >>>>>> zhongresearch@...> > > > > > >>>> > > > > > >>>>> wrote: > > > > > >>>>>>> > > > > > >>>>>>> Ryan, > > > > > >>>>>>>> I am very appreciated for your feedbacks. > > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds which > is > > > way > > > > > >>>>>>>> > > > > > >>>>>>> higher > > > > > >>>> > > > > > >>>>> than > > > > > >>>>>>>> 40ms. > > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be > > > > > >>>>>>>> sufficient. > > > > > >>>>>>>> I also tried GC options like > > > > > >>>>>>>> > > > > > >>>>>>>> -XX:ParallelGCThreads=8 > > > > > >>>>>>>> -XX:+UseConcMarkSweepGC > > > > > >>>>>>>> > > > > > >>>>>>>> I even set the vm.swappiness=0 > > > > > >>>>>>>> > > > > > >>>>>>>> However, I still came across the problem that a > RegionServer > > > > > >>>>>>>> shutdown > > > > > >>>>>>>> itself. > > > > > >>>>>>>> > > > > > >>>>>>>> Best, > > > > > >>>>>>>> zhong > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > > > > ryanobjc@...> > > > > > >>>>>>>> > > > > > >>>>>>> wrote: > > > > > >>>>> > > > > > >>>>>> Set the ZK timeout to something like 40ms, and give the GC > > > > enough > > > > > >>>>>>>> > > > > > >>>>>>> Xmx > > > > > >>>> > > > > > >>>>> so you never risk entering the much dreaded > > > > concurrent-mode-failure > > > > > >>>>>>>>> whereby the entire heap must be GCed. > > > > > >>>>>>>>> > > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > > > > > >>>>>>>>> > > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has done > > so > > > > yet. > > > > > I > > > > > >>>>>>>>> > > > > > >>>>>>>> am > > > > > >>>> > > > > > >>>>> personally hoping for G1 and in the meantime overprovision > our > > > Xmx > > > > > >>>>>>>>> > > > > > >>>>>>>> to > > > > > >>>> > > > > > >>>>> avoid the concurrent mode failures. > > > > > >>>>>>>>> > > > > > >>>>>>>>> -ryan > > > > > >>>>>>>>> > > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > > > > >>>>>>>>> > > > > > >>>>>>>> zhongresearch@...> > > > > > >>>>> > > > > > >>>>>> wrote: > > > > > >>>>>>>>> > > > > > >>>>>>>>> Ryan, > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Thank you very much. > > > > > >>>>>>>>>> May I ask whether there are any ways to get around this > > > > problem > > > > > to > > > > > >>>>>>>>>> > > > > > >>>>>>>>> make > > > > > >>>>> > > > > > >>>>>> HBase more stable? > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> best, > > > > > >>>>>>>>>> zhong > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > > > > > ryanobjc@...> > > > > > >>>>>>>>>> wrote: > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a > > > > prototype. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that > we > > > > could > > > > > >>>>>>>>>>> get > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>> rid > > > > > >>>>> > > > > > >>>>>> of all long pauses, and avoid the need for this. > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> -ryan > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > > > > >>>>>>>>>>> zhongresearch@...> > > > > > >>>>>>>>>>> wrote: > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> Hi, > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey > proposed > > > and > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>> would > > > > > >>>> > > > > > >>>>> like > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>> to > > > > > >>>>>>>>>> > > > > > >>>>>>>>>>> have a try. > > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this > > > zk_wrapper > > > > in > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>> JNI > > > > > >>>> > > > > > >>>>> integration? > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> I would be very appreciated. > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> thanks > > > > > >>>>>>>>>>>> zhong > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> > > > > > >>> > > > > > > > > > > > > > > > > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)Stack,
I am very appreciated for your help. Yes, I am using cacti to monitor the loads etc. I also upped my zk seesion timeout to 600sec. May I ask what the default connection timeout for a zookeeper client to connect to a quorum? thanks zhenyu On Thu, Oct 29, 2009 at 6:06 PM, stack <stack@...> wrote: > If it stole machine resources, yeah, it could. Do you have anything to > watch your cluster with in place? Ganglia or some such so you can watch > the > loadings? Is the machine with the RS that is going down swapping? You > could try upping your zk session timeout in your hbase cluster. > St.Ack > > On Thu, Oct 29, 2009 at 3:00 PM, Zhenyu Zhong <zhongresearch@... > >wrote: > > > Anything that possibly gets started is another MR job working on other > > dataset in the same time as this test was running. So some node might be > > under heavy loads. > > I am wondering whether that would cause the connection timeout. > > > > thanks > > zhenyu > > > > > > > > On Thu, Oct 29, 2009 at 5:32 PM, stack <stack@...> wrote: > > > > > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <zhongresearch@... > > > >wrote: > > > > > > > I have 19 quorum members now. > > > > > > > > Thats too many. Have 3 or maybe 5. See zk site for rationale. > > > > > > > > > > > > > When I did test on loading data to two columnfamilies of one table in > > > HBase > > > > using two seperate MR jobs, I lost my regionserver and the test > failed. > > > > > > > > Does HBase allow such table update operation? > > > > > > > > The errors I got while I lost my regionserver is: > > > > 2009-10-29 21:09:34,705 INFO > org.apache.hadoop.hbase.regionserver.HLog: > > > > Roll > > > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d > > > > at.1256849620029, entries=271911, calcsize=63754142, > filesize=33975611. > > > New > > > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl > > > > og.dat.1256850574705 > > > > 2009-10-29 21:09:50,322 WARN > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > > > > > > > > > > > > > You have read the 'Getting Started' and made the necessary changes to > > > filedescriptors and xceiver count? > > > > > > You will see above message after a regionserver has restarted and tries > > to > > > go back to the master (what hbase is this? I think you said it 0.20.x). > > > > > > > > > > > > > > > > java.io.IOException: TIMED OUT > > > > at > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > > > > 2009-10-29 21:09:50,873 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper > > event, > > > > state: Disconnected, type: None, path: > > > > null > > > > > > > > > > This is timeout against zk. You've lost your session. The RS will go > > > down. The connection to zk is basic to hbase. Something is up. In > the > > > past others have reported things like incorrect bios settings on disks > > that > > > have made the disks run slow or just something up with the networking. > > Can > > > you check all is healthy? You seem to be having too many issues for > such > > a > > > small loading with such a large cluster. > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: > > > > > > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong < > > > zhongresearch@... > > > > > >wrote: > > > > > > > > > > > FYI > > > > > > It looks like increasing the number of Zookeeper Quorums can > solve > > > the > > > > > > following error message : org.apache.hadoop.hbase. > > > > > > client.NoServerForRegionException: Timed out trying to locate > root > > > > region > > > > > > at > > > > > > org.apache.hadoop.hbase. > > > > > > > > > > > > You mean quorum members? How many do you have now? > > > > > > > > > > > > > > > > > > > > > Now I am running Zookeeper quorum on each node I have. > > > > > > However, I am still having issues about losing regionserver. > > > > > > > > > > > > Whats in the logs? > > > > > > > > > > > > > > > > > > > > > > > > > > Is there a way to browse the Znode in zookeeper? > > > > > > > > > > > > > > > > > Type 'zk' in the hbase shell. > > > > > > > > > > You can get to the zk shell from hbase shell. You so things like: > > > > > > > > > > > zk "ls /" > > > > > > > > > > (Yes, quotes needed). > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > thanks > > > > > > zhenyu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong < > > > zhongresearch@... > > > > > > >wrote: > > > > > > > > > > > > > JG, > > > > > > > > > > > > > > > > > > > > > Thanks a lot for the tips. > > > > > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > > > > > > > -XX:+UseConcMarkSweepGC. > > > > > > > > > > > > > > I checked the logs in my Master an RS and found the following > > > errors. > > > > > > > Basically, master got exception error while scanning ROOT, then > > the > > > > > ROOT > > > > > > > region was offline and unset. Thus the regionserver can't get > > > > > > > NotservingRegion errors. > > > > > > > > > > > > > > In the master: > > > > > > > 2009-10-28 19:00:30,591 INFO > > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > > RegionManager.rootScanner scanning meta region {server: x.x.x. > > > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} > > > > > > > 2009-10-28 19:00:30,591 WARN > > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > > Scan ROOT region > > > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local > > > > exception: > > > > > > > java.io.EOFException > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > > > at > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > > > Caused by: java.io.EOFException > > > > > > > at > > > java.io.DataInputStream.readInt(DataInputStream.java:375) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > > > > > > > 2009-10-28 19:00:30,591 INFO > > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > > RegionManager.metaScanner scanning meta region {server: x.x.x. > > > > > > > x:60021, regionname: .META.,,1, startKey: <>} > > > > > > > 2009-10-28 19:00:30,591 WARN > > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M > > > > > > > ETA.,,1, startKey: <>} > > > > > > > java.net.ConnectException: Connection refused > > > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native > > Method) > > > > > > > at > > > > > > > > > > > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > > > > > at > > > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > > > > > > > at $Proxy1.openScanner(Unknown Source) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > > > > > > > at > > > > > > > > > > > > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > > > > > 2009-10-28 19:00:30,591 INFO > > > > > org.apache.hadoop.hbase.master.BaseScanner: > > > > > > > All 1 .META. region(s) scanned > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.ServerManager: > > > > > > > Removing server's info YYYY,60021,125675547057 > > > > > > > 0 > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > > Offlined ROOT server: x.x.x.x:60021 > > > > > > > > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > > -ROOT- region unset (but not set to be reassigned) > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > > ROOT inserted into regionsInTransition > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > > Offlining META region: {server: x.x.x.x:60021, regionname: > > > .META.,,1, > > > > > > > startKey: <>} > > > > > > > 2009-10-28 19:00:31,395 INFO > > > > > > org.apache.hadoop.hbase.master.RegionManager: > > > > > > > META region removed from onlineMetaRegions > > > > > > > > > > > > > > > > > > > > > > > > > > > > On the regionserver: > > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > MSG_REGION_OPEN: > > > > > > > test,,1256755871065 > > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: > > > > > > MSG_REGION_OPEN: > > > > > > > test,,1256755871065 > > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > > > region test,,1256755871065/796855017 available; sequence id is > > > > 10013291 > > > > > > > 2009-10-28 18:51:14,578 INFO > > > > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > > > > Starting compaction on region test,,1256755871065 > > > > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: > > Got > > > > ping > > > > > > > response for sessionid:0x249c76021d0001 after 0ms > > > > > > > 2009-10-28 18:51:19,341 ERROR > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > > test,,1256754924503 > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > > > > > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown > > > > Source) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > > > at > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > > > > > > > at > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > > > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: > > IPC > > > > > > Server > > > > > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, > > > > > maxVersions=1, > > > > > > > timeRange=[0,9223372036854775807), > > > > families={(family=email_ip_activity, > > > > > > > columns=ALL}) from x.x.x.x:54669: error: > > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > > test,,1256754924503 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray < > > jlist@...> > > > > > > wrote: > > > > > > > > > > > > > >> These client error messages are not particular descriptive as > to > > > the > > > > > > root > > > > > > >> cause (they are fatal errors, or close to it). > > > > > > >> > > > > > > >> What is going on in your regionservers when these errors > happen? > > > > > Check > > > > > > >> the master and RS logs. > > > > > > >> > > > > > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce > > that > > > to > > > > 3 > > > > > > or > > > > > > >> 5 max. > > > > > > >> > > > > > > >> What is the hardware you are using for these nodes, and what > > > > settings > > > > > do > > > > > > >> you have for heap/GC? > > > > > > >> > > > > > > >> JG > > > > > > >> > > > > > > >> > > > > > > >> Zhenyu Zhong wrote: > > > > > > >> > > > > > > >>> Stack, > > > > > > >>> > > > > > > >>> Thank you very much for your comments. > > > > > > >>> I am running a cluster with 20 nodes. I set 19 as both > > > regionserver > > > > > and > > > > > > >>> zookeeper quorums. > > > > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. > > > > > > >>> I started with an empty table and try to load 200 million > > records > > > > > into > > > > > > >>> that > > > > > > >>> empty table. > > > > > > >>> There is a key in each record. Logically, in my MR program, > > > during > > > > > the > > > > > > >>> setup, I opened an HTable, in my mapper, I fetch the record > > from > > > > > HTable > > > > > > >>> via > > > > > > >>> key in the record, then make some changes to the columns and > > > update > > > > > > that > > > > > > >>> row > > > > > > >>> back to HTable through TableOutputFormat by passing a put. > > There > > > is > > > > > no > > > > > > >>> reduce tasks involved here. (Though it is unnecessary to > fetch > > > row > > > > > > from > > > > > > >>> an > > > > > > >>> empty table, I just intended to do that) > > > > > > >>> > > > > > > >>> Additionally, when I reduced the number of regionservers and > > > number > > > > > of > > > > > > >>> zookeeper quorums. > > > > > > >>> I had different errors: > > > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: > > Timed > > > > out > > > > > > >>> trying > > > > > > >>> to locate root region at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) > > > > > > >>> at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) > > > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at > > > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at > > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) > > > > > > >>> at > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) > > > > at > > > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at > > > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) > > > > > > >>> > > > > > > >>> Many thanks in advance. > > > > > > >>> zhenyu > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> > > > wrote: > > > > > > >>> > > > > > > >>> Whats your cluster topology? How many nodes involved? When > > you > > > > see > > > > > > the > > > > > > >>>> below message, how many regions in your table? How are you > > > > loading > > > > > > your > > > > > > >>>> table? > > > > > > >>>> Thanks, > > > > > > >>>> St.Ack > > > > > > >>>> > > > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < > > > > > > zhongresearch@... > > > > > > >>>> > > > > > > >>>>> wrote: > > > > > > >>>>> Nitay, > > > > > > >>>>> > > > > > > >>>>> I am very appreciated. > > > > > > >>>>> > > > > > > >>>>> As Ryan suggested, I increased the zookeeper session > timeout > > to > > > > > > >>>>> 40seconds > > > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 > > > > > > >>>>> > > > > > > >>>> -XX:+UseConcMarkSweepGC > > > > > > >>>> > > > > > > >>>>> in place. I set the Heapsize to 4GB. I also set the > > > > > vm.swappiness=0. > > > > > > >>>>> > > > > > > >>>>> However it still ran into problem. Please find the > following > > > > > errors. > > > > > > >>>>> > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Trying > > > > to > > > > > > >>>>> contact region server x.x.x.x:60021 for region > > > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but > failed > > > > after > > > > > > 10 > > > > > > >>>>> attempts. > > > > > > >>>>> Exceptions: > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: > > > Failed > > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 > > > > > > >>>>> > > > > > > >>>>> at > > > > > > >>>>> > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) > > > > > > >>>> > > > > > > >>>>> at > > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> The input file is about 10GB around 200million rows of > data. > > > > > > >>>>> This load doesn't seem too large. However this kind of > errors > > > > keep > > > > > > >>>>> > > > > > > >>>> popping > > > > > > >>>> > > > > > > >>>>> up. > > > > > > >>>>> > > > > > > >>>>> Does Regionserver need to be deployed to dedicated > machines? > > > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines as > > > well? > > > > > > >>>>> > > > > > > >>>>> Best, > > > > > > >>>>> zhenyu > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> > > > wrote: > > > > > > >>>>> > > > > > > >>>>> Hi Zhenyu, > > > > > > >>>>>> > > > > > > >>>>>> Sorry for the delay. I started working on this a while > back, > > > > > before > > > > > > I > > > > > > >>>>>> > > > > > > >>>>> left > > > > > > >>>>> > > > > > > >>>>>> my job for another company. Since then I haven't had much > > time > > > > to > > > > > > work > > > > > > >>>>>> > > > > > > >>>>> on > > > > > > >>>> > > > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and > see > > > > what > > > > > > >>>>>> > > > > > > >>>>> shape > > > > > > >>>> > > > > > > >>>>> it's in and update you. > > > > > > >>>>>> > > > > > > >>>>>> Cheers, > > > > > > >>>>>> -n > > > > > > >>>>>> > > > > > > >>>>>> > > > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: > > > > > > >>>>>> > > > > > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". > > You > > > > can > > > > > > >>>>>> > > > > > > >>>>>>> still see multi-second pauses at times, so you need to > give > > > > > > yourself > > > > > > >>>>>>> a > > > > > > >>>>>>> bigger buffer. > > > > > > >>>>>>> > > > > > > >>>>>>> The parallel threads argument should not be necessary, > but > > > you > > > > do > > > > > > >>>>>>> need > > > > > > >>>>>>> the UseConcMarkSweepGC flag as well. > > > > > > >>>>>>> > > > > > > >>>>>>> Let us know how it goes! > > > > > > >>>>>>> -ryan > > > > > > >>>>>>> > > > > > > >>>>>>> > > > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < > > > > > > >>>>>>> > > > > > > >>>>>> zhongresearch@...> > > > > > > >>>> > > > > > > >>>>> wrote: > > > > > > >>>>>>> > > > > > > >>>>>>> Ryan, > > > > > > >>>>>>>> I am very appreciated for your feedbacks. > > > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds > which > > is > > > > way > > > > > > >>>>>>>> > > > > > > >>>>>>> higher > > > > > > >>>> > > > > > > >>>>> than > > > > > > >>>>>>>> 40ms. > > > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should > be > > > > > > >>>>>>>> sufficient. > > > > > > >>>>>>>> I also tried GC options like > > > > > > >>>>>>>> > > > > > > >>>>>>>> -XX:ParallelGCThreads=8 > > > > > > >>>>>>>> -XX:+UseConcMarkSweepGC > > > > > > >>>>>>>> > > > > > > >>>>>>>> I even set the vm.swappiness=0 > > > > > > >>>>>>>> > > > > > > >>>>>>>> However, I still came across the problem that a > > RegionServer > > > > > > >>>>>>>> shutdown > > > > > > >>>>>>>> itself. > > > > > > >>>>>>>> > > > > > > >>>>>>>> Best, > > > > > > >>>>>>>> zhong > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < > > > > > ryanobjc@...> > > > > > > >>>>>>>> > > > > > > >>>>>>> wrote: > > > > > > >>>>> > > > > > > >>>>>> Set the ZK timeout to something like 40ms, and give the > GC > > > > > enough > > > > > > >>>>>>>> > > > > > > >>>>>>> Xmx > > > > > > >>>> > > > > > > >>>>> so you never risk entering the much dreaded > > > > > concurrent-mode-failure > > > > > > >>>>>>>>> whereby the entire heap must be GCed. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has > done > > > so > > > > > yet. > > > > > > I > > > > > > >>>>>>>>> > > > > > > >>>>>>>> am > > > > > > >>>> > > > > > > >>>>> personally hoping for G1 and in the meantime overprovision > > our > > > > Xmx > > > > > > >>>>>>>>> > > > > > > >>>>>>>> to > > > > > > >>>> > > > > > > >>>>> avoid the concurrent mode failures. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> -ryan > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < > > > > > > >>>>>>>>> > > > > > > >>>>>>>> zhongresearch@...> > > > > > > >>>>> > > > > > > >>>>>> wrote: > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> Ryan, > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> Thank you very much. > > > > > > >>>>>>>>>> May I ask whether there are any ways to get around > this > > > > > problem > > > > > > to > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>> make > > > > > > >>>>> > > > > > > >>>>>> HBase more stable? > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> best, > > > > > > >>>>>>>>>> zhong > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < > > > > > > ryanobjc@...> > > > > > > >>>>>>>>>> wrote: > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a > > > > > prototype. > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that > > we > > > > > could > > > > > > >>>>>>>>>>> get > > > > > > >>>>>>>>>>> > > > > > > >>>>>>>>>> rid > > > > > > >>>>> > > > > > > >>>>>> of all long pauses, and avoid the need for this. > > > > > > >>>>>>>>>>> > > > > > > >>>>>>>>>>> -ryan > > > > > > >>>>>>>>>>> > > > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < > > > > > > >>>>>>>>>>> zhongresearch@...> > > > > > > >>>>>>>>>>> wrote: > > > > > > >>>>>>>>>>> > > > > > > >>>>>>>>>>> Hi, > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey > > proposed > > > > and > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>> would > > > > > > >>>> > > > > > > >>>>> like > > > > > > >>>>>>>>>>> > > > > > > >>>>>>>>>> to > > > > > > >>>>>>>>>> > > > > > > >>>>>>>>>>> have a try. > > > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this > > > > zk_wrapper > > > > > in > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>> JNI > > > > > > >>>> > > > > > > >>>>> integration? > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>>> I would be very appreciated. > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>>> thanks > > > > > > >>>>>>>>>>>> zhong > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>>> > > > > > > >>>>>>>>>>>> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
|
|
Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)BTW, if a regionserver disconnect, would a mannually restart on this
particular regionserver helps ? Best, zhenyu On Thu, Oct 29, 2009 at 6:26 PM, Zhenyu Zhong <zhongresearch@...>wrote: > Stack, > > I am very appreciated for your help. > Yes, I am using cacti to monitor the loads etc. I also upped my zk seesion > timeout to 600sec. > May I ask what the default connection timeout for a zookeeper client to > connect to a quorum? > > thanks > zhenyu > > > > On Thu, Oct 29, 2009 at 6:06 PM, stack <stack@...> wrote: > >> If it stole machine resources, yeah, it could. Do you have anything to >> watch your cluster with in place? Ganglia or some such so you can watch >> the >> loadings? Is the machine with the RS that is going down swapping? You >> could try upping your zk session timeout in your hbase cluster. >> St.Ack >> >> On Thu, Oct 29, 2009 at 3:00 PM, Zhenyu Zhong <zhongresearch@... >> >wrote: >> >> > Anything that possibly gets started is another MR job working on other >> > dataset in the same time as this test was running. So some node might >> be >> > under heavy loads. >> > I am wondering whether that would cause the connection timeout. >> > >> > thanks >> > zhenyu >> > >> > >> > >> > On Thu, Oct 29, 2009 at 5:32 PM, stack <stack@...> wrote: >> > >> > > On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong < >> zhongresearch@... >> > > >wrote: >> > > >> > > > I have 19 quorum members now. >> > > > >> > > > Thats too many. Have 3 or maybe 5. See zk site for rationale. >> > > >> > > >> > > >> > > > When I did test on loading data to two columnfamilies of one table >> in >> > > HBase >> > > > using two seperate MR jobs, I lost my regionserver and the test >> failed. >> > > > >> > > > Does HBase allow such table update operation? >> > > > >> > > > The errors I got while I lost my regionserver is: >> > > > 2009-10-29 21:09:34,705 INFO >> org.apache.hadoop.hbase.regionserver.HLog: >> > > > Roll >> > > > /hbase/.logs/YYYY,60021,1256849619429/hlog.d >> > > > at.1256849620029, entries=271911, calcsize=63754142, >> filesize=33975611. >> > > New >> > > > hlog /hbase/.logs/YYYY,60021,1256849619429/hl >> > > > og.dat.1256850574705 >> > > > 2009-10-29 21:09:50,322 WARN >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 >> > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException >> > > > >> > > >> > > >> > > You have read the 'Getting Started' and made the necessary changes to >> > > filedescriptors and xceiver count? >> > > >> > > You will see above message after a regionserver has restarted and >> tries >> > to >> > > go back to the master (what hbase is this? I think you said it >> 0.20.x). >> > > >> > > >> > > >> > > >> > > > java.io.IOException: TIMED OUT >> > > > at >> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) >> > > > 2009-10-29 21:09:50,873 INFO >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper >> > event, >> > > > state: Disconnected, type: None, path: >> > > > null >> > > > >> > > >> > > This is timeout against zk. You've lost your session. The RS will go >> > > down. The connection to zk is basic to hbase. Something is up. In >> the >> > > past others have reported things like incorrect bios settings on disks >> > that >> > > have made the disks run slow or just something up with the networking. >> > Can >> > > you check all is healthy? You seem to be having too many issues for >> such >> > a >> > > small loading with such a large cluster. >> > > >> > > St.Ack >> > > >> > > >> > > >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, Oct 29, 2009 at 2:51 PM, stack <stack@...> wrote: >> > > > >> > > > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong < >> > > zhongresearch@... >> > > > > >wrote: >> > > > > >> > > > > > FYI >> > > > > > It looks like increasing the number of Zookeeper Quorums can >> solve >> > > the >> > > > > > following error message : org.apache.hadoop.hbase. >> > > > > > client.NoServerForRegionException: Timed out trying to locate >> root >> > > > region >> > > > > > at >> > > > > > org.apache.hadoop.hbase. >> > > > > > >> > > > > > You mean quorum members? How many do you have now? >> > > > > >> > > > > >> > > > > >> > > > > > Now I am running Zookeeper quorum on each node I have. >> > > > > > However, I am still having issues about losing regionserver. >> > > > > > >> > > > > > Whats in the logs? >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > > Is there a way to browse the Znode in zookeeper? >> > > > > > >> > > > > > >> > > > > Type 'zk' in the hbase shell. >> > > > > >> > > > > You can get to the zk shell from hbase shell. You so things like: >> > > > > >> > > > > > zk "ls /" >> > > > > >> > > > > (Yes, quotes needed). >> > > > > >> > > > > St.Ack >> > > > > >> > > > > >> > > > > >> > > > > > thanks >> > > > > > zhenyu >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong < >> > > zhongresearch@... >> > > > > > >wrote: >> > > > > > >> > > > > > > JG, >> > > > > > > >> > > > > > > >> > > > > > > Thanks a lot for the tips. >> > > > > > > I set the HEAP to 4GB and GC options as >> -XX:ParallelGCThreads=8 >> > > > > > > -XX:+UseConcMarkSweepGC. >> > > > > > > >> > > > > > > I checked the logs in my Master an RS and found the following >> > > errors. >> > > > > > > Basically, master got exception error while scanning ROOT, >> then >> > the >> > > > > ROOT >> > > > > > > region was offline and unset. Thus the regionserver can't get >> > > > > > > NotservingRegion errors. >> > > > > > > >> > > > > > > In the master: >> > > > > > > 2009-10-28 19:00:30,591 INFO >> > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > RegionManager.rootScanner scanning meta region {server: x.x.x. >> > > > > > > x:60021, regionname: -ROOT-,,0, startKey: <>} >> > > > > > > 2009-10-28 19:00:30,591 WARN >> > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > Scan ROOT region >> > > > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local >> > > > exception: >> > > > > > > java.io.EOFException >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) >> > > > > > > at >> > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) >> > > > > > > at >> > > > > > > >> > > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) >> > > > > > > at $Proxy1.openScanner(Unknown Source) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) >> > > > > > > at >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) >> > > > > > > at >> > > > > > > >> > > > >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) >> > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) >> > > > > > > Caused by: java.io.EOFException >> > > > > > > at >> > > java.io.DataInputStream.readInt(DataInputStream.java:375) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) >> > > > > > > 2009-10-28 19:00:30,591 INFO >> > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > RegionManager.metaScanner scanning meta region {server: x.x.x. >> > > > > > > x:60021, regionname: .META.,,1, startKey: <>} >> > > > > > > 2009-10-28 19:00:30,591 WARN >> > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M >> > > > > > > ETA.,,1, startKey: <>} >> > > > > > > java.net.ConnectException: Connection refused >> > > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native >> > Method) >> > > > > > > at >> > > > > > > >> > > > >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >> > > > > > > at >> > > org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) >> > > > > > > at >> > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) >> > > > > > > at >> > > > > > > >> > > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) >> > > > > > > at $Proxy1.openScanner(Unknown Source) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) >> > > > > > > at >> > > > > > > >> > > > >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) >> > > > > > > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) >> > > > > > > 2009-10-28 19:00:30,591 INFO >> > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > All 1 .META. region(s) scanned >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.ServerManager: >> > > > > > > Removing server's info YYYY,60021,125675547057 >> > > > > > > 0 >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > Offlined ROOT server: x.x.x.x:60021 >> > > > > > > >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > -ROOT- region unset (but not set to be reassigned) >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > ROOT inserted into regionsInTransition >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > Offlining META region: {server: x.x.x.x:60021, regionname: >> > > .META.,,1, >> > > > > > > startKey: <>} >> > > > > > > 2009-10-28 19:00:31,395 INFO >> > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > META region removed from onlineMetaRegions >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On the regionserver: >> > > > > > > 2009-10-28 18:51:14,578 INFO >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: >> > > MSG_REGION_OPEN: >> > > > > > > test,,1256755871065 >> > > > > > > 2009-10-28 18:51:14,578 INFO >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: >> > > > > > MSG_REGION_OPEN: >> > > > > > > test,,1256755871065 >> > > > > > > 2009-10-28 18:51:14,578 INFO >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion: >> > > > > > > region test,,1256755871065/796855017 available; sequence id is >> > > > 10013291 >> > > > > > > 2009-10-28 18:51:14,578 INFO >> > > > > > org.apache.hadoop.hbase.regionserver.HRegion: >> > > > > > > Starting compaction on region test,,1256755871065 >> > > > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: >> > Got >> > > > ping >> > > > > > > response for sessionid:0x249c76021d0001 after 0ms >> > > > > > > 2009-10-28 18:51:19,341 ERROR >> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException: >> > > > test,,1256754924503 >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) >> > > > > > > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown >> > > > Source) >> > > > > > > at >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > > > > > > at java.lang.reflect.Method.invoke(Method.java:597) >> > > > > > > at >> > > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >> > > > > > > at >> > > > > > > >> > > > > >> > > >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >> > > > > > > 2009-10-28 18:51:19,341 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > IPC >> > > > > > Server >> > > > > > > handler 0 on 60021, call get([B@21fefd80, row=1053508149, >> > > > > maxVersions=1, >> > > > > > > timeRange=[0,9223372036854775807), >> > > > families={(family=email_ip_activity, >> > > > > > > columns=ALL}) from x.x.x.x:54669: error: >> > > > > > > org.apache.hadoop.hbase.NotServingRegionException: >> > > > test,,1256754924503 >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray < >> > jlist@...> >> > > > > > wrote: >> > > > > > > >> > > > > > >> These client error messages are not particular descriptive as >> to >> > > the >> > > > > > root >> > > > > > >> cause (they are fatal errors, or close to it). >> > > > > > >> >> > > > > > >> What is going on in your regionservers when these errors >> happen? >> > > > > Check >> > > > > > >> the master and RS logs. >> > > > > > >> >> > > > > > >> Also, you definitely do not want 19 zookeeper nodes. Reduce >> > that >> > > to >> > > > 3 >> > > > > > or >> > > > > > >> 5 max. >> > > > > > >> >> > > > > > >> What is the hardware you are using for these nodes, and what >> > > > settings >> > > > > do >> > > > > > >> you have for heap/GC? >> > > > > > >> >> > > > > > >> JG >> > > > > > >> >> > > > > > >> >> > > > > > >> Zhenyu Zhong wrote: >> > > > > > >> >> > > > > > >>> Stack, >> > > > > > >>> >> > > > > > >>> Thank you very much for your comments. >> > > > > > >>> I am running a cluster with 20 nodes. I set 19 as both >> > > regionserver >> > > > > and >> > > > > > >>> zookeeper quorums. >> > > > > > >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. >> > > > > > >>> I started with an empty table and try to load 200 million >> > records >> > > > > into >> > > > > > >>> that >> > > > > > >>> empty table. >> > > > > > >>> There is a key in each record. Logically, in my MR program, >> > > during >> > > > > the >> > > > > > >>> setup, I opened an HTable, in my mapper, I fetch the record >> > from >> > > > > HTable >> > > > > > >>> via >> > > > > > >>> key in the record, then make some changes to the columns and >> > > update >> > > > > > that >> > > > > > >>> row >> > > > > > >>> back to HTable through TableOutputFormat by passing a put. >> > There >> > > is >> > > > > no >> > > > > > >>> reduce tasks involved here. (Though it is unnecessary to >> fetch >> > > row >> > > > > > from >> > > > > > >>> an >> > > > > > >>> empty table, I just intended to do that) >> > > > > > >>> >> > > > > > >>> Additionally, when I reduced the number of regionservers and >> > > number >> > > > > of >> > > > > > >>> zookeeper quorums. >> > > > > > >>> I had different errors: >> > > > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: >> > Timed >> > > > out >> > > > > > >>> trying >> > > > > > >>> to locate root region at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) >> > > > > > >>> at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) >> > > > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) >> at >> > > > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at >> > > > > > >>> >> > > > > > >>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) >> > > > > > >>> at >> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) >> > > > at >> > > > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at >> > > > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170) >> > > > > > >>> >> > > > > > >>> Many thanks in advance. >> > > > > > >>> zhenyu >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@...> >> > > wrote: >> > > > > > >>> >> > > > > > >>> Whats your cluster topology? How many nodes involved? >> When >> > you >> > > > see >> > > > > > the >> > > > > > >>>> below message, how many regions in your table? How are you >> > > > loading >> > > > > > your >> > > > > > >>>> table? >> > > > > > >>>> Thanks, >> > > > > > >>>> St.Ack >> > > > > > >>>> >> > > > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong < >> > > > > > zhongresearch@... >> > > > > > >>>> >> > > > > > >>>>> wrote: >> > > > > > >>>>> Nitay, >> > > > > > >>>>> >> > > > > > >>>>> I am very appreciated. >> > > > > > >>>>> >> > > > > > >>>>> As Ryan suggested, I increased the zookeeper session >> timeout >> > to >> > > > > > >>>>> 40seconds >> > > > > > >>>>> along with the GC options -XX:ParallelGCThreads=8 >> > > > > > >>>>> >> > > > > > >>>> -XX:+UseConcMarkSweepGC >> > > > > > >>>> >> > > > > > >>>>> in place. I set the Heapsize to 4GB. I also set the >> > > > > vm.swappiness=0. >> > > > > > >>>>> >> > > > > > >>>>> However it still ran into problem. Please find the >> following >> > > > > errors. >> > > > > > >>>>> >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Trying >> > > > to >> > > > > > >>>>> contact region server x.x.x.x:60021 for region >> > > > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but >> failed >> > > > after >> > > > > > 10 >> > > > > > >>>>> attempts. >> > > > > > >>>>> Exceptions: >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: >> > > Failed >> > > > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >> > > > > > >>>>> >> > > > > > >>>>> at >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) >> > > > > > >>>> >> > > > > > >>>>> at >> > > > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> The input file is about 10GB around 200million rows of >> data. >> > > > > > >>>>> This load doesn't seem too large. However this kind of >> errors >> > > > keep >> > > > > > >>>>> >> > > > > > >>>> popping >> > > > > > >>>> >> > > > > > >>>>> up. >> > > > > > >>>>> >> > > > > > >>>>> Does Regionserver need to be deployed to dedicated >> machines? >> > > > > > >>>>> Does Zookeeper need to be deployed to dedicated machines >> as >> > > well? >> > > > > > >>>>> >> > > > > > >>>>> Best, >> > > > > > >>>>> zhenyu >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@...> >> > > wrote: >> > > > > > >>>>> >> > > > > > >>>>> Hi Zhenyu, >> > > > > > >>>>>> >> > > > > > >>>>>> Sorry for the delay. I started working on this a while >> back, >> > > > > before >> > > > > > I >> > > > > > >>>>>> >> > > > > > >>>>> left >> > > > > > >>>>> >> > > > > > >>>>>> my job for another company. Since then I haven't had much >> > time >> > > > to >> > > > > > work >> > > > > > >>>>>> >> > > > > > >>>>> on >> > > > > > >>>> >> > > > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and >> see >> > > > what >> > > > > > >>>>>> >> > > > > > >>>>> shape >> > > > > > >>>> >> > > > > > >>>>> it's in and update you. >> > > > > > >>>>>> >> > > > > > >>>>>> Cheers, >> > > > > > >>>>>> -n >> > > > > > >>>>>> >> > > > > > >>>>>> >> > > > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: >> > > > > > >>>>>> >> > > > > > >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". >> > You >> > > > can >> > > > > > >>>>>> >> > > > > > >>>>>>> still see multi-second pauses at times, so you need to >> give >> > > > > > yourself >> > > > > > >>>>>>> a >> > > > > > >>>>>>> bigger buffer. >> > > > > > >>>>>>> >> > > > > > >>>>>>> The parallel threads argument should not be necessary, >> but >> > > you >> > > > do >> > > > > > >>>>>>> need >> > > > > > >>>>>>> the UseConcMarkSweepGC flag as well. >> > > > > > >>>>>>> >> > > > > > >>>>>>> Let us know how it goes! >> > > > > > >>>>>>> -ryan >> > > > > > >>>>>>> >> > > > > > >>>>>>> >> > > > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < >> > > > > > >>>>>>> >> > > > > > >>>>>> zhongresearch@...> >> > > > > > >>>> >> > > > > > >>>>> wrote: >> > > > > > >>>>>>> >> > > > > > >>>>>>> Ryan, >> > > > > > >>>>>>>> I am very appreciated for your feedbacks. >> > > > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds >> which >> > is >> > > > way >> > > > > > >>>>>>>> >> > > > > > >>>>>>> higher >> > > > > > >>>> >> > > > > > >>>>> than >> > > > > > >>>>>>>> 40ms. >> > > > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should >> be >> > > > > > >>>>>>>> sufficient. >> > > > > > >>>>>>>> I also tried GC options like >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> -XX:ParallelGCThreads=8 >> > > > > > >>>>>>>> -XX:+UseConcMarkSweepGC >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> I even set the vm.swappiness=0 >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> However, I still came across the problem that a >> > RegionServer >> > > > > > >>>>>>>> shutdown >> > > > > > >>>>>>>> itself. >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> Best, >> > > > > > >>>>>>>> zhong >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> >> > > > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson < >> > > > > ryanobjc@...> >> > > > > > >>>>>>>> >> > > > > > >>>>>>> wrote: >> > > > > > >>>>> >> > > > > > >>>>>> Set the ZK timeout to something like 40ms, and give the >> GC >> > > > > enough >> > > > > > >>>>>>>> >> > > > > > >>>>>>> Xmx >> > > > > > >>>> >> > > > > > >>>>> so you never risk entering the much dreaded >> > > > > concurrent-mode-failure >> > > > > > >>>>>>>>> whereby the entire heap must be GCed. >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>>> Consider testing Java 7 and the G1 GC. >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>>> We could get a JNI thread to do this, but no one has >> done >> > > so >> > > > > yet. >> > > > > > I >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>> am >> > > > > > >>>> >> > > > > > >>>>> personally hoping for G1 and in the meantime >> overprovision >> > our >> > > > Xmx >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>> to >> > > > > > >>>> >> > > > > > >>>>> avoid the concurrent mode failures. >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>>> -ryan >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>> zhongresearch@...> >> > > > > > >>>>> >> > > > > > >>>>>> wrote: >> > > > > > >>>>>>>>> >> > > > > > >>>>>>>>> Ryan, >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> Thank you very much. >> > > > > > >>>>>>>>>> May I ask whether there are any ways to get around >> this >> > > > > problem >> > > > > > to >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>> make >> > > > > > >>>>> >> > > > > > >>>>>> HBase more stable? >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> best, >> > > > > > >>>>>>>>>> zhong >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson < >> > > > > > ryanobjc@...> >> > > > > > >>>>>>>>>> wrote: >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>> There isnt any working code yet. Just an idea, and a >> > > > > prototype. >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC >> that >> > we >> > > > > could >> > > > > > >>>>>>>>>>> get >> > > > > > >>>>>>>>>>> >> > > > > > >>>>>>>>>> rid >> > > > > > >>>>> >> > > > > > >>>>>> of all long pauses, and avoid the need for this. >> > > > > > >>>>>>>>>>> >> > > > > > >>>>>>>>>>> -ryan >> > > > > > >>>>>>>>>>> >> > > > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >> > > > > > >>>>>>>>>>> zhongresearch@...> >> > > > > > >>>>>>>>>>> wrote: >> > > > > > >>>>>>>>>>> >> > > > > > >>>>>>>>>>> Hi, >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>>> I am very interesting to the solution that Joey >> > proposed >> > > > and >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>> would >> > > > > > >>>> >> > > > > > >>>>> like >> > > > > > >>>>>>>>>>> >> > > > > > >>>>>>>>>> to >> > > > > > >>>>>>>>>> >> > > > > > >>>>>>>>>>> have a try. >> > > > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this >> > > > zk_wrapper >> > > > > in >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>> JNI >> > > > > > >>>> >> > > > > > >>>>> integration? >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>>> I would be very appreciated. >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>>> thanks >> > > > > > >>>>>>>>>>>> zhong >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>>> >> > > > > > >>>>>>>>>>>> >> > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > > |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |