|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
How would you use HBase in this scenario?I am evaluating replacing a homegrown file storage system with HBase.
Here are the stats on our current environment: - Our workload is going to all be single record reads and writes - We have 50TB of data, with each record being a10kb to 10mb in size, (average of 300kb), in a single column - Peak 60k reads per hour - Peak 20k writes per hour Here are the questions: What sort of hardware should I be looking at here? Will capacity scale linearly as we add more servers, and for how long? Will I be able to get at least a 250ms access time with a reasonable cluster size? From what I understand, we're looking at a theoretical 64mb block read from disk for any row. In practice, how significant is this, when taking in caching and other optimizations? We can do sequential writes, but at an expected total loss of an anticipated ~50% reduction in store size due to compression. I am also concerned that sequential writes at the end of the table will all end up writing to one disk - instead of distributing the load across all servers. Thanks in advance, -Jason |
|
|
Re: How would you use HBase in this scenario?Hey,
HBase should work for your scenario. It will require a more than just a few nodes, but should be doable. So some answers: - Generally as long as you dont have hot spots, adding nodes creates linear scaling. -- How long? Could be a while, there are not 1000 node HBase installations yet, but there is nothing in the model that prevents it. - Access time is a function of how long you have to wait for disk IO, so this is controllable via how many disks you put in a node, and how many nodes you have. Generally HBase can really kick out the data in the data sizes you describe. - 64 MB block size - reading a portion of a file doesnt trigger an entire 64MB block read. Hadoop IO uses a chunk checksum system, and the chunks are 64kb, thus you may need to read 2 * 64kb if your data spans a checksum block. This isn't nearly as a big deal as it seems. - HBase includes in-line LZO compression which is really great. You should use it. - Writes... The last question deserves a paragraph answer. You're concerned about writing to the 'end' of a table. In HBase once a table grows, the entire table is split into multiple segments that are assigned to servers. Each segment contains a continuous run of keys, eg: [A,B). So thus to prevent a write-hot spot, you will want to not use a sequence ID for a primary key, or anything that resembles it (ie: timestamps for keys). Instead if you can have a key that is more random, or at least more spread out, then you get parallel write options and you can achieve very high read/write speeds. Can hbase support 60k read ops/sec? This becomes a function of how much seeking and spindles you have, and how effective OS block caching is. Adding more spindles either per node, or total nodes, this helps. For pure-pure random reads, you are at the mercy of a 9ms seek, and really reduces the number of reads/sec you can do. This isn't a HBase problem, you'll run into it on any system. Solutions are caching, more spindles and possibly flash. Good luck! -ryan On Wed, Nov 4, 2009 at 3:42 PM, Jason Strutz <jason@...> wrote: > I am evaluating replacing a homegrown file storage system with HBase. Here > are the stats on our current environment: > - Our workload is going to all be single record reads and writes > - We have 50TB of data, with each record being a10kb to 10mb in size, > (average of 300kb), in a single column > - Peak 60k reads per hour > - Peak 20k writes per hour > > Here are the questions: > What sort of hardware should I be looking at here? Will capacity scale > linearly as we add more servers, and for how long? > > Will I be able to get at least a 250ms access time with a reasonable cluster > size? > > From what I understand, we're looking at a theoretical 64mb block read from > disk for any row. In practice, how significant is this, when taking in > caching and other optimizations? > > We can do sequential writes, but at an expected total loss of an anticipated > ~50% reduction in store size due to compression. I am also concerned that > sequential writes at the end of the table will all end up writing to one > disk - instead of distributing the load across all servers. > > Thanks in advance, > -Jason > |
| Free embeddable forum powered by Nabble | Forum Help |