|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Improving RAIDframe Parity Handling: The DiffMy Google Summer of Code project this past summer was to make raid(4)
not need to check every single bit of parity on a mirror or RAID-[45] set after an unclean shutdown. The reason for the parity check is that a write operation might have been in progress, and might have been committed to some disks but not others, leaving inconsistent parity, at the time of the unclean shutdown; thus, the solution is to keep better track of which parts of the RAID might have been being written. The approach I took, which is also used by other RAID implementations, was to divide the set into a certain number of regions and keep, on disk, a dirty bit per region; this bit is set before any write, but cleared only if some amount of time has passed with no intervening writes, in order to keep the I/O overhead low. (See also my mentor's summary at http://blog.netbsd.org/tnf/entry/summer_of_code_results_improving ) I have prepared a patch against HEAD as of a few days ago, available at http://www.NetBSD.org/~jld/gsoc09-1017.diff ; the changes to the raidctl(8) man page should explain things reasonably well, and if not then that can be fixed. I have done some testing, in particular to determine reasonable default parameters, but given the particular importance of the correctness of anything related to storage, it needs more testing and ideally more eyeballs. Note in particular that, with the patch, a parity map will be used by default with any non-RAID-0 set. As far as compatibility with non-parity-map kernels, I spent a fair amount of time on this and it should Do The Right Thing -- the parity map is in addition to the existing global dirty bit which is maintained as before, and if the previous kernel to touch the RAID was not parity-map enabled, a parity-map kernel will detect this and disregard the parity map. As mentioned above, I have done some benchmarking, mainly with the case of untarring pkgsrc on async FFS (wapbl is about the same[*]) on a RAID-1; thus, many small writes with mediocre locality, and not dominated by small-stripe parity-RAID overhead. I found that, with the defaults settings I arrived at, the I/O overhead was small enough that it couldn't be reliably measured, and the reduction in parity considered dirty was... if I read my notes correctly, one such test on a 136GB mirror had at most 1.2GB dirty at any given time. Another test of 10 parallel pkgsrc-untarrings, went up to 9.8GB. I was initially going to do another round of benchmarking and get some harder numbers before posting this, but then school started eating my life, and it's been delayed quite enough, I think. Comments and questions (and testing) welcome. [*] In particular, the journal should get hit enough to never be marked clean as long as there's any I/O, and thus incur no ongoing overhead. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Tue, Oct 20, 2009 at 04:53:59PM -0400, Jed Davis wrote:
> Note in particular that, with the patch, a parity map will be used by > default with any non-RAID-0 set. As far as compatibility with > non-parity-map kernels, I spent a fair amount of time on this and it > should Do The Right Thing -- the parity map is in addition to the > existing global dirty bit which is maintained as before, and if the > previous kernel to touch the RAID was not parity-map enabled, a > parity-map kernel will detect this and disregard the parity map. That sounds like a reasonable approach. How does a kernel with parity map support detect that there is no parity map present on disk? > Another test of 10 parallel pkgsrc-untarrings, went up to 9.8GB. My server should be able to synchronise that in about two minutes which is great improvement over the two and half hours in need now. > Comments and questions (and testing) welcome. From your patch it looks like it would be feasable to pull this up into the "netbsd-5" branch. Can you see any reason why that would not be possible? Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The DiffMatthias Scheler <tron@...> writes:
> On Tue, Oct 20, 2009 at 04:53:59PM -0400, Jed Davis wrote: >> As far as compatibility with non-parity-map kernels, I spent a fair >> amount of time on this and it should Do The Right Thing -- the parity >> map is in addition to the existing global dirty bit which is >> maintained as before, and if the previous kernel to touch the RAID >> was not parity-map enabled, a parity-map kernel will detect this and >> disregard the parity map. > > That sounds like a reasonable approach. How does a kernel with > parity map support detect that there is no parity map present on disk? The RAID component label has a certain number of unused fields, which are zeroed on label creation and left unchanged otherwise. Thus, allocating a bit from one of them to indicate that a parity-map kernel has ever touched the RAID suffices. For determining what the last kernel to touch the RAID was, I'm storing a copy of the label modification counter (incremented whenever the label is changed to avoid combining inconsistent components) in another of the formerly unused fields; a non-parity-map kernel will leave the old number in place while incrementing the modification counter itself, so they will not match. > From your patch it looks like it would be feasable to pull this > up into the "netbsd-5" branch. Can you see any reason why that > would not be possible? At one point I thought there was a reason, but looking over the diff I'm not sure what that might have been. I think I'll try merging it and see what happens. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Sun, 01 Nov 2009, Jed Davis wrote:
> The RAID component label has a certain number of unused fields, which > are zeroed on label creation and left unchanged otherwise. Thus, > allocating a bit from one of them to indicate that a parity-map kernel > has ever touched the RAID suffices. For determining what the last > kernel to touch the RAID was, I'm storing a copy of the label > modification counter (incremented whenever the label is changed to avoid > combining inconsistent components) in another of the formerly unused > fields; a non-parity-map kernel will leave the old number in place while > incrementing the modification counter itself, so they will not match. It seem to me that the extra counter would be sufficient, and that you don't also need the extra status bit. Essentially, you get the same information from (extra copy of modification counter != 0) as you get from the status bit. --apb (Alan Barrett) |
|
|
re: Improving RAIDframe Parity Handling: The Diffhi! i've been running with your patch for a day now, and i've tried to break it pretty hard, and i haven't succeeded. my notes: - overall, i'm very impressed. the patch looks clean and i've not observed problems i would consider shipstoppers. i didn't look really closely at the changes themselves. - seems to deal fine with normal reboots and also with hard power failures - newfs tends to dirty a huge portion of zones. for my 250GiB filesystem, newfs dirtied 1491 out of 4096 zones, which is a few more than the total number of cyl groups: using 1425 cylinder groups of 184.30MB, 11795 blks, 23296 inodes. these zones cleared up a few minutes later, without syncing 1491 * 64MB, so this will only be a problem with a crash in the minutes after a newfs - with 10 extractors of pkgsrc and one 'cvs co src xsrc' (and rm -rf's for the both) running all in parallel, i ended up with about 250 dirty zones out of 4096, which seems pretty high. i haven't seen it go beyond 514 except for the newfs case.. but >1/8th seems a lot. - (nit) "raidctl -s" output is confusing for parity reconstruction. the percentage done doesn't seem to make sense for me now. from a guess, it is not considering in-sync but beyond the current sync-point as being in-sync so that the percentage done number grows at strange speeds, slow while in a dirty zone, and rapidly while skipping clean zones. - have not done any performance measurements - might be nice to add a comment to the RAIDFRAME_SET_COMPONENT_LABEL ioctl that the new #if 0'ed code is not well tested? - be nice to get answers from some one (hi greg!) from your XXXjld's great work! .mrg. |
|
|
Re: Improving RAIDframe Parity Handling: The DiffJed Davis <jld@...> writes:
> Matthias Scheler <tron@...> writes: >> From your patch it looks like it would be feasable to pull this >> up into the "netbsd-5" branch. Can you see any reason why that >> would not be possible? > > At one point I thought there was a reason, but looking over the diff I'm > not sure what that might have been. I think I'll try merging it and see > what happens. What happens is that it builds cleanly, and minimal testing under QEMU hasn't broken it. Looking at the diff between my diffs revealed a mistake in the one I originally posted -- I seem to have gotten a merge-with-HEAD wrong and accidentally reverted r1.265 of rf_netbsdkintf.c. So, here are new diffs: http://www.NetBSD.org/~jld/gsoc09-1103.diff (-current) http://www.NetBSD.org/~jld/gsoc09-1103-n5.diff (netbsd-5) I've checked that they both describe the same difference from their respective branch heads. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Tue, Nov 03, 2009 at 06:02:05PM +1100, matthew green wrote:
> - newfs tends to dirty a huge portion of zones. for my 250GiB filesystem, > newfs dirtied 1491 out of 4096 zones, which is a few more than the total > number of cyl groups: > > using 1425 cylinder groups of 184.30MB, 11795 blks, 23296 inodes. > > these zones cleared up a few minutes later, without syncing 1491 * 64MB, > so this will only be a problem with a crash in the minutes after a newfs I'm sorry but I don't understand why this is a problem. It should only mean that the parity rewrite takes longer. What am I missing? Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The Difftron@... (Matthias Scheler) writes:
>I'm sorry but I don't understand why this is a problem. It should only >mean that the parity rewrite takes longer. What am I missing? Maybe the zones could be optimized for this case, but I doubt that this is possible without degrading performance for the normal case (i.e. writing files sequentially). -- -- Michael van Elst Internet: mlelstv@... "A potential Snark may lurk in every tree." |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Wed, 4 Nov 2009 19:40:03 +0000 (UTC)
mlelstv@... (Michael van Elst) wrote: > tron@... (Matthias Scheler) writes: > > >I'm sorry but I don't understand why this is a problem. It should only > >mean that the parity rewrite takes longer. What am I missing? > > Maybe the zones could be optimized for this case, but I doubt that > this is possible without degrading performance for the normal case > (i.e. writing files sequentially). If your disk is "all busy, all the time", then parity zones arn't going to buy you much... Even if there are 500 of 4000 zones marked as 'dirty' the checking is still going to nearly an order of magnitude faster than before. One could keep the number of dirty zones down to a bare minimum by updating the zone status at the end of every write, but then performance would be absymal. But really, one only intends on using this if the system happens to crash, and so it's all about finding the balance between performance now and performance after an event that one doesn't want nor expect to happen for a while... Later... Greg Oster |
|
|
re: Improving RAIDframe Parity Handling: The DiffOn Tue, Nov 03, 2009 at 06:02:05PM +1100, matthew green wrote: > - newfs tends to dirty a huge portion of zones. for my 250GiB filesystem, > newfs dirtied 1491 out of 4096 zones, which is a few more than the total > number of cyl groups: > > using 1425 cylinder groups of 184.30MB, 11795 blks, 23296 inodes. > > these zones cleared up a few minutes later, without syncing 1491 * 64MB, > so this will only be a problem with a crash in the minutes after a newfs I'm sorry but I don't understand why this is a problem. It should only mean that the parity rewrite takes longer. What am I missing? nothing. that's exactly what i was saying. if the system were to crash shortly after newfs, >1/3rd of my map was dirty. that's a lot. but that's all i was trying to say above... .mrg. |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Wed, Nov 04, 2009 at 01:55:54PM -0600, Greg Oster wrote:
> If your disk is "all busy, all the time", then parity zones arn't going > to buy you much... Indeed. And I doubt this can or should be fixed in FFS. A log structured file-system would probably work better in this case. > Even if there are 500 of 4000 zones marked as 'dirty' the checking > is still going to nearly an order of magnitude faster than before. Even 2000 on of 4000 zones would speed up parity rebuild on my system by over an hour. On a modern (almost) 2TB hard disk each zone would be 0.5GB large. Not copying it around saves a lot of time. Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Tue, Nov 03, 2009 at 01:35:10PM -0500, Jed Davis wrote:
> So, here are new diffs: > > http://www.NetBSD.org/~jld/gsoc09-1103.diff (-current) > http://www.NetBSD.org/~jld/gsoc09-1103-n5.diff (netbsd-5) I've looked at the diffs. Here are my comments: 1.) You use "u_int" and "int" inside a structure which defines on the on-disk data. I wonder whether "uint32_t" and "int32_t" would be the better choice. 2.) rf_paritymap_test() should return a "bool". 3.) Can you please comment rf_paritymap_begin_region() and rf_paritymap_end_region()? 4.) Could "lk_flags" be removed if you use atomic_ops(3) to update the "flags" field of a parity map? Your locking looks safe (because you stick to the defined order). But I feel somehow uneasy about this. Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The Diffmatthew green <mrg@...> writes:
> - newfs tends to dirty a huge portion of zones. for my 250GiB filesystem, > newfs dirtied 1491 out of 4096 zones, which is a few more than the total > number of cyl groups: > > using 1425 cylinder groups of 184.30MB, 11795 blks, 23296 inodes. > > these zones cleared up a few minutes later, without syncing 1491 * 64MB, > so this will only be a problem with a crash in the minutes after a newfs Indeed; they were written to, which usually means that there might be more writes nearby soon (but happens to not to mean that in this case). Note that the "few minutes later" is adjustable via raidctl(8); see under "-M set"... > - with 10 extractors of pkgsrc and one 'cvs co src xsrc' (and rm -rf's for > the both) running all in parallel, i ended up with about 250 dirty zones > out of 4096, which seems pretty high. i haven't seen it go beyond 514 > except for the newfs case.. but >1/8th seems a lot. ...meaning that, if one wanted to have fewer dirty zones at the cost of a bit more overhead from parity map updates, that can be accomplished. At one point I'd had thoughts of making the parameters automagically adjust based on load, but that seemed... delicate, at best, and the fixed settings do still beat the status quo by quite a bit. > - (nit) "raidctl -s" output is confusing for parity reconstruction. the > percentage done doesn't seem to make sense for me now. from a guess, it > is not considering in-sync but beyond the current sync-point as being > in-sync so that the percentage done number grows at strange speeds, slow > while in a dirty zone, and rapidly while skipping clean zones. It's reporting how far through the RAID set it is, just like before; it's just that some parts of the disk are now "checked" without doing any I/O, so they should complete more or less instantaneously. It looks like this can actually be fixed without too much work, or doing anything to userland. > - might be nice to add a comment to the RAIDFRAME_SET_COMPONENT_LABEL ioctl > that the new #if 0'ed code is not well tested? Might be. I suspect that the old #if 0'ed code may not have been well tested, either. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The Diffmlelstv@... (Michael van Elst) writes:
> tron@... (Matthias Scheler) writes: [newfs] >>I'm sorry but I don't understand why this is a problem. It should only >>mean that the parity rewrite takes longer. What am I missing? > > Maybe the zones could be optimized for this case, but I doubt that > this is possible without degrading performance for the normal case > (i.e. writing files sequentially). There are also alternate cooldown strategies that might help newfs-like access patterns and wouldn't harm sequential writes, but I don't know how they'd do with the many-small-files cases. This might be worth experimenting with, now that I think about it. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The DiffMatthias Scheler <tron@...> writes:
> On Tue, Nov 03, 2009 at 01:35:10PM -0500, Jed Davis wrote: >> So, here are new diffs: >> >> http://www.NetBSD.org/~jld/gsoc09-1103.diff (-current) >> http://www.NetBSD.org/~jld/gsoc09-1103-n5.diff (netbsd-5) > > I've looked at the diffs. Here are my comments: > 1.) You use "u_int" and "int" inside a structure which defines on > the on-disk data. I wonder whether "uint32_t" and "int32_t" > would be the better choice. I wondered that as well. Note that all the existing label fields are the same way (and, in particular, the int array that I took my fields out of). Do we have any platforms with non-32-bit "int" that would break from changing one to the other? > 2.) rf_paritymap_test() should return a "bool". I didn't realize at the time that we had a "bool"; that can be done (and there are probably some variables that can be altered likewise). > 4.) Could "lk_flags" be removed if you use atomic_ops(3) to update > the "flags" field of a parity map? Your locking looks safe > (because you stick to the defined order). But I feel somehow > uneasy about this. I think it could; at the time I might not have known I wasn't going to wind up putting more fields under it, or something along those lines, but at this point I think the change can be made. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
Re: Improving RAIDframe Parity Handling: The DiffAlan Barrett <apb@...> writes:
> On Sun, 01 Nov 2009, Jed Davis wrote: >> The RAID component label has a certain number of unused fields, which >> are zeroed on label creation and left unchanged otherwise. Thus, >> allocating a bit from one of them to indicate that a parity-map kernel >> has ever touched the RAID suffices. For determining what the last >> kernel to touch the RAID was, I'm storing a copy of the label >> modification counter (incremented whenever the label is changed to avoid >> combining inconsistent components) in another of the formerly unused >> fields; a non-parity-map kernel will leave the old number in place while >> incrementing the modification counter itself, so they will not match. > > It seem to me that the extra counter would be sufficient, and that you > don't also need the extra status bit. Essentially, you get the same > information from (extra copy of modification counter != 0) as you get > from the status bit. While it's quite unlikely that the mod counter would ever wrap, I thought it best to have an explicit indicator; also, there are other flags in that flag word, so there isn't much gained by removing one of them. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) |
|
|
re: Improving RAIDframe Parity Handling: The Diff> tron@... (Matthias Scheler) writes: [newfs] >>I'm sorry but I don't understand why this is a problem. It should only >>mean that the parity rewrite takes longer. What am I missing? > > Maybe the zones could be optimized for this case, but I doubt that > this is possible without degrading performance for the normal case > (i.e. writing files sequentially). There are also alternate cooldown strategies that might help newfs-like access patterns and wouldn't harm sequential writes, but I don't know how they'd do with the many-small-files cases. This might be worth experimenting with, now that I think about it. i'd like to make it clear that i am was not trying to say that the behaviour with newfs is a problem, just something i observed. if it can be mitigated that would be great, but i don't think this is a problem of any major sort... .mrg. |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Thu, Nov 05, 2009 at 12:34:41PM -0500, Jed Davis wrote:
> > 4.) Could "lk_flags" be removed if you use atomic_ops(3) to update > > the "flags" field of a parity map? Your locking looks safe > > (because you stick to the defined order). But I feel somehow > > uneasy about this. > > I think it could; at the time I might not have known I wasn't going to > wind up putting more fields under it, or something along those lines, > but at this point I think the change can be made. I think it might be worthwhile to do that because a single atomic operation will be cheaper. Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Sat, Nov 07, 2009 at 11:23:25AM +0000, Matthias Scheler wrote:
> On Thu, Nov 05, 2009 at 12:34:41PM -0500, Jed Davis wrote: > > > 4.) Could "lk_flags" be removed if you use atomic_ops(3) to update > > > the "flags" field of a parity map? Your locking looks safe > > > (because you stick to the defined order). But I feel somehow > > > uneasy about this. > > > > I think it could; at the time I might not have known I wasn't going to > > wind up putting more fields under it, or something along those lines, > > but at this point I think the change can be made. > > I think it might be worthwhile to do that because a single atomic > operation will be cheaper. BTW: this is only a suggestion for a minor optimisation, not a required change before you commit. I'm using your "netbsd-5" patch on my server since a few hours now and the system works fine so far. Kind regards -- Matthias Scheler http://zhadum.org.uk/ |
|
|
Re: Improving RAIDframe Parity Handling: The DiffOn Tue, Nov 03, 2009 at 01:35:10PM -0500, Jed Davis wrote:
> So, here are new diffs: > > http://www.NetBSD.org/~jld/gsoc09-1103.diff (-current) > http://www.NetBSD.org/~jld/gsoc09-1103-n5.diff (netbsd-5) I am running with your netbsd-5 diff and it seems to work fine. With the default parameters (4096 regions of 86Mb), "normal" usage keeps between 1 and 2% dirty, with peaks up to ~10%. Geert -- Geert Hendrickx -=- ghen@... -=- PGP: 0xC4BB9E9F This e-mail was composed using 100% recycled spam messages! |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |