|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
|
|
Some more questions, preallocationThanks for the replies to my questions of some time ago, and now
some more questions. I have become interested in the subject of preallocation, which I suspect is particularly important for a filesystem with extents and a buddy system free list. Having thought about it, the use of a buddy system for file space allocation seems a particularly good idea, because the one or two big problem of the buddy system for memory allocation are not relevant for files. But it can be less optimal if files are written piecemeal. What I want to know is how a common case like extracting a number of files from a 'tar' archive works, in particular in terms of preallocation and how many extents result. So some questions about JFS as it is now: * How to see how many extents of which size have been allocated to an inode from the command line? * How to list the free list from the command line? * Suppose a filesystem is empty, and 'tar' extracts to it an 8MiB file, writing 32KiB blocks. How many extents will it span? After it has been written, what will the free list look like? * How to preallocate space? As in, allocate to a file much more space than its size. For example, when writing to a file it may be known in advance that it will take 8MiB ('cp', 'tar', 'wget', ...), so create the file with a size of 0 but 8MiB allocated. * Would 'ftruncate'(2) (or the mythical 'posix_fallocate'(2)) with an argument greater than the size of the file do a preallocation as per X/Open? It does not seem so now as in 'jfs_truncate_nolock()' the test looks like '(newsize > length)'. http://WWW.OpenGroup.org/onlinepubs/009695399/functions/ftruncate.html http://WWW.OpenGroup.org/onlinepubs/009695399/functions/posix_fallocate.html * How hard would it be to add global and/or per-filesystem (permanent or at mount time) tunables to set: - a minimum extent size? for example to ensure that no extent smaller than 1MiB is allocated? The purpose is to ensure that files tend to be contiguous. - alternatively, a default minimum extent size? So that the extents are initially allocated of that size, but can be reduced by 'close'(2) or 'ftruncate'(2) to the actual size of the file. For example so that when extracting from 'tar' a minimum extent size of say 256KiB is used, but when the file is closed or truncated the last extent can get chopped to less than that. - a maximum extent size? For example to ensure that no extent larger than 256KiB is ever allocated? The purpose is to minimize internal fragmentation by allocating only at the lower levels of the buddy system. I hope that the rationales are fairly clear; part of that is to short circuit when possioble the somewhat hairy ''hint'' related logic in 'jfs_dmap.c' and that in 'jfs_open()' for example. While these are common strategies, I suspect that preallocation of one form or another is better as he above may impair locality. I have also noticed 'XAD_NOTRECORDED' that seems to indicate that preallocation is indeed being done or at least anticipated. ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocationOn Mon, 2005-10-24 at 01:06 +0100, Peter Grandi wrote:
> Thanks for the replies to my questions of some time ago, and now > some more questions. > > I have become interested in the subject of preallocation, which > I suspect is particularly important for a filesystem with > extents and a buddy system free list. > > Having thought about it, the use of a buddy system for file > space allocation seems a particularly good idea, because the one > or two big problem of the buddy system for memory allocation are > not relevant for files. > > But it can be less optimal if files are written piecemeal. > > What I want to know is how a common case like extracting a > number of files from a 'tar' archive works, in particular > in terms of preallocation and how many extents result. > > So some questions about JFS as it is now: > > * How to see how many extents of which size have been allocated > to an inode from the command line? You can use jfs_debugfs to see the xtree of an inode > * How to list the free list from the command line? There is no free list. The block map is a binary-buddy tree, where the leaves contain bitmaps. Again, jfs_debugfs can be used to look at the block map. Use the dmap, subcommand. Hopefully, it's not too hard to navigate around. > * Suppose a filesystem is empty, and 'tar' extracts to it an > 8MiB file, writing 32KiB blocks. How many extents will it > span? After it has been written, what will the free list > look like? I just tried creating a file with 'dd bs=32768 count=256' on a newly formatted jfs volume, and it went into one extent. > * How to preallocate space? As in, allocate to a file much more > space than its size. For example, when writing to a file it > may be known in advance that it will take 8MiB ('cp', 'tar', > 'wget', ...), so create the file with a size of 0 but 8MiB > allocated. There's no code to do that now, but we could create an extent and mark it ABNR (allocated but not recorded). This is a holdover from OS/2, where the default behavior was to have dense files, rather than sparse ones. Ideally, generic_file_write would use the get_blocks file system callback to allocate all of the blocks needed for a write at one time. Currently, space is allocated one page at a time. Of course, jfs will always try to allocate the next consecutive block and append it to an existing extent if possible, so if the free space is not too fragmented, we generally end up with large extents. > * Would 'ftruncate'(2) (or the mythical 'posix_fallocate'(2)) > with an argument greater than the size of the file do a > preallocation as per X/Open? It does not seem so now as in > 'jfs_truncate_nolock()' the test looks like '(newsize > > length)'. No space will be allocated. > http://WWW.OpenGroup.org/onlinepubs/009695399/functions/ftruncate.html > http://WWW.OpenGroup.org/onlinepubs/009695399/functions/posix_fallocate.html > > * How hard would it be to add global and/or per-filesystem > (permanent or at mount time) tunables to set: > > - a minimum extent size? for example to ensure that no extent > smaller than 1MiB is allocated? The purpose is to ensure > that files tend to be contiguous. Non-trivial, but it shouldn't be a show-stopper. jfs_fsck would have to be taught to not flag an error or blocks are allocated beyond the file's size. How would we want to handle low-free-space conditions? Allow shorter allocations if no 1 MB allocations are possible? Would we need a mechanism to free unused preallocations? > - alternatively, a default minimum extent size? So that the > extents are initially allocated of that size, but can be > reduced by 'close'(2) or 'ftruncate'(2) to the actual size > of the file. For example so that when extracting from 'tar' > a minimum extent size of say 256KiB is used, but when the > file is closed or truncated the last extent can get chopped > to less than that. A jfs volume is logically divided into a number of allocation groups (AGs). While a file is opened, jfs will always try to put allocations to other files in a separate AG. This generally works pretty well, in that sequential operations, like untarring archives will close one file before opening the next, and groups of files are put near each other with no wasted space between them. Where another file being created by another task will be allocated somewhere else on disk. > - a maximum extent size? For example to ensure that no extent > larger than 256KiB is ever allocated? The purpose is to > minimize internal fragmentation by allocating only at the > lower levels of the buddy system. I'm not sure what that would buy us. > I hope that the rationales are fairly clear; part of that is to > short circuit when possioble the somewhat hairy ''hint'' related > logic in 'jfs_dmap.c' and that in 'jfs_open()' for example. I don't understand the problem with the "hint". The hints are used to attempt to allocated file data near the inode, or to append onto existing extents when the following blocks are available. > While these are common strategies, I suspect that preallocation > of one form or another is better as he above may impair locality. I think preallocation may be useful in some circumstances, i.e. when a file is created non-sequentially, but I am concerned that leaving preallocated, but unused, blocks between actual file data will result in more fragmentation, or just wasted space on the disk. > I have also noticed 'XAD_NOTRECORDED' that seems to indicate > that preallocation is indeed being done or at least anticipated. This dates back to the roots of jfs in OS/2 where files were dense by default, so non-sequential writes would cause ABNR (as I mentioned above) extents to be created where data had not yet been written. It would be possible to add support for dense files in Linux, but it hasn't really been asked for. -- David Kleikamp IBM Linux Technology Center ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocation>>> On Sun, 23 Oct 2005 22:53:02 -0500, Dave Kleikamp
>>> <shaggy@...> said: shaggy> On Mon, 2005-10-24 at 01:06 +0100, Peter Grandi wrote: [ ... ] >> * How to see how many extents of which size have been >> allocated to an inode from the command line? >> * How to list the free list from the command line? shaggy> There is no free list. The block map is a binary-buddy shaggy> tree, where the leaves contain bitmaps. Ahhh, I mentioned that the free list is a buddy tree somewhere else. I have done _some_ homework :-). shaggy> [ ... 'jfs_debugs' and 'xtree' and 'dmap' ... ] shaggy> Hopefully, it's not too hard to navigate around. Well, I was angling for something neater than using 'jfs_debugfs', but perhaps I can wrap it up in some script... >> * Suppose a filesystem is empty, and 'tar' extracts to it an >> 8MiB file, writing 32KiB blocks. How many extents will it >> span? After it has been written, what will the free list >> look like? shaggy> I just tried creating a file with 'dd bs=32768 shaggy> count=256' on a newly formatted jfs volume, and it went shaggy> into one extent. Yes, this is more or less what I was expecting (that is, it obviously first creates 32KiB extents and then coalesces them after writing them). [ ... ] shaggy> There's no code to do that now, but we could create an shaggy> extent and mark it ABNR (allocated but not recorded). shaggy> This is a holdover from OS/2, where the default behavior shaggy> was to have dense files, rather than sparse ones. Uhm, instead of having ABNR one could simply have a the convention that whole area between file size and length be zero-on-demand, whether allocated or not. ABNR seems like for a hole ''in the middle'' of a file, and the difference between that and a preallocated area beyond its end is that while in the former case the preallocated area is accessible, in the latter is not, so if the former is unwritten one must flag it specially or actually zero it; the latter needs neither (the flag is its position). However on a 'seek' beyond the current file end might require turning some of that into into ABNR extents or zeroing them... Bah! shaggy> [ ... ] Currently, space is allocated one page at a shaggy> time. [ ... appending ... ] we generally end up with shaggy> large extents. Yes, but this results in rather hairy code to handle after-the-fact coalescing of buddy extents (I guess the JFS terminology is the ''backsplits'' mentioned in the comments), and can result in other suboptimal behaviour, more later on this. [ ... ] >> - a minimum extent size? for example to ensure that no extent >> smaller than 1MiB is allocated? The purpose is to ensure >> that files tend to be contiguous. shaggy> Non-trivial, but it shouldn't be a show-stopper. shaggy> jfs_fsck would have to be taught to not flag an error or shaggy> blocks are allocated beyond the file's size. I believe this would be a generally good idea, and modifying 'jfs_fsck' for this might be pretty easy. I might try... shaggy> [ ... ] Would we need a mechanism to free unused shaggy> preallocations? For an ''unconditional' minimum extent size that was a tunable, I would not handle low space conditions at all. And actually in general because of two reasons: * In any case it is a user tunable. The use can always leave the 1 block default unchanged. * If you have a low space condition where one (or even more than one) 1MiB makes a difference on a say 40GiB filesystem, who cares... * In any case several memory (but we can sort of rely on them here too) allocation studies show that in most cases it is pointless to handle low free memory conditions cleverly, because in such cases it just delays an almost inevitable overflow, does not avoid it. Again, especially if the margin is of the order of 1MiB over a 40GiB arena. Also, preallocations can be ephemeral (disappear on close) or persistent. The former are useful in their own right and have very little downside. >> - alternatively, a default minimum extent size? So that the >> extents are initially allocated of that size, but can be >> reduced by 'close'(2) or 'ftruncate'(2) to the actual size >> of the file. [ ... ] shaggy> A jfs volume is logically divided into a number of shaggy> allocation groups (AGs). While a file is opened, jfs shaggy> will always try to put allocations to other files in a shaggy> separate AG. This generally works pretty well, [ ... ] Yes, this is like in BSD FFS and 'ext[23]' cylinder groups, where it is also used to ensure all ''groups'' have some free space. As you say, it works well especially to prevent interleaving on parallel writes. But I surmise that a preallocate logic delivers the same result more reliably, and a few more advantages. >> - a maximum extent size? For example to ensure that no extent >> larger than 256KiB is ever allocated? The purpose is to >> minimize internal fragmentation by allocating only at the >> lower levels of the buddy system. shaggy> I'm not sure what that would buy us. Well, it would prevent really large extents from happening. In a buddy system very large allocations (wrt to the size of the whole) cause trouble. >> I hope that the rationales are fairly clear; Well, perhaps a bit more is needed, so bear with me if I repeat a bit of the obvious here, to illustrate the rest. A buddy system has two big problems for _memory_ allocation, that it does not coalesce buddies that are not part of the same branch (''buddy system'' in the language of the comments in the JFS source) and that it can waste a lot of space in internal fragmentation because of the power-of-2 thing. These problems are big for memory allocation, especially if allocations are not very small with respect to the size of the arena. But they are essentially irrelevant for file allocation, and thus the use of a buddy system for JFS is fairly brilliant, because files do not need to be wholly contiguous, unlike memory blocks. it is only for performance reasons that they should be mostly contiguous. The size of an optimally contiguous extent should be mostly depending on the speed characteristics of the disc, which are fairly constant. If one has discs with a seek time of 5-10ms, that do a full rotation in 5-10ms, and have a transfer rate of 50-100MiB/s that is 50-100KiB/ms, it is not that essential to have 100MiB contiguous regions, a MiB at a time or so (that is a few ms at a time) is pretty reasonable. So for example an 11MiB file should ideally be not in one 16MiB extent, but in three ones, 8+2+1MiB (or arguably 8+4MiB), and now I wonder what happens with JFS -- I shall try. At the same time truly large extents ''tie up'' many higher levels of the buddy tree, and too much of an AG. So instead of having a say 1GiB file as a single extent, having it as several 64MiB (or whatever) extents may be perfectly reasonable and give some advantages as to performance. Even DM/LVM2 allocates logical volumes with that kind of granularity. Conversely, when allocating smaller files, having a minimum advisory extent size of say 256KiB can help enesuring contiguity even in massive multithreaded writes or on fragemented free lists. Also, in general preallocation (that is the ability of having file length != size by semi-arbitrary amounts) can help considerably in several special cases. I have in mind in particular these scenarios: * In many important cases the end size of a file is well known in advance when the file is created. So one might as well preallocate the whole file in advance. * In many important cases files are overwritten with contents of much the same size as they were (e.g. recompiling a '.c' into a '.o'). So one might as well, tentatively, preserve the previously allocated size on a 'ftruncate', and then do it for real on a 'close'. The two above points (which would be greatly enhanced by trivial changes to 'libc' and some common utilities) are relevant to me because I care also about minimizing ''churn'' over the lifetime of a filesystem, not just how good it performs freshly laid out, where the free list is a single block to start with and remains totally contiguous. Note: indeed considering that many filesystems are created from a 'tar x' (resulting in a ''perfect'' layout) and then updated, overwrites would help preserving preserve the initially ''perfect'' layout. I have another scenario in mind: * DM is basically a simple minded ''manual allocation of extents'' filesystem, and LVM2 is basically '-o loop' over it. * Imagine a 2300GiB JFS filesystem, with a minimum extent size of 1GiB and a maximum extent size of say 16GiB (never mind the AG limits :->), mounted perhaps with '-o nointegrity'. * Such a filesystem plus '-o loop' (built on 'md' if needed) looks to me like a ''for free'' LVM, and with essentially the same performance, and with no need for special utilities or configuration. >> [ ... ] part of that is to short circuit when possible the >> somewhat hairy ''hint'' related logic in 'jfs_dmap.c' and >> that in 'jfs_open()' for example. shaggy> I don't understand the problem with the "hint". The shaggy> hints are used to attempt to allocated file data near shaggy> the inode, or to append onto existing extents when the shaggy> following blocks are available. Ah yes sure, but they have a hairy logic and they have conceivale performance limitations. My understanding of the current logic is to allocate small extents (the size of a 'write'(2) I guess), use the hint to put them near each other and then coalesce them if possible (if the hints worked that well). But this is complex and has some limitations (which do not happen if one is doing an initial load of a filesystem, but otherwise may bite). It is designed to make the best of a bad lot, if one does not know in advance the end size of a file. Allocating larger extents to start with and then cut that down on 'close' or whatever seems to me it could be a rather more successful if the free list (the buddy tree) is already a bit fragmented, and in any case handles several common cases where the end size is known in advance directly. Another way: the difference between attempting to allocate directly a 1MiB extent and something like 64x16KiB ones is that the first does a top-down scan of the buddy tree, the second does in effect a bottom up one, trying to find a 1MiB free block ''after the fact''. Even with hints, probably latter only works well if the free list is really almost unfragmented. Also, there may be several 1MiB free blocks in a somewhat fragmented free list, but it is random whether the 16KiB extents get allocated inside one, so that after-the-fact coalescing them back into the original 1Mib one might not work that well. [ ... ] shaggy> I think preallocation may be useful in some shaggy> circumstances, i.e. when a file is created shaggy> non-sequentially, I think that apart from non-sequential or parallel writes (and the AG switch helps in the latter case) or when the appends have already happened, the free list (buddy tree) is already fragemented (the the AG switch does not help). The general case for preallocation is made for example in this 'ext[23]' paper: http://WWW.USENIX.org/events/usenix02/tech/freenix/full_papers/tso/tso_html/ and later updates like this (even if some of the arguments do not apply that much to JFS): http://ext2.SourceForge.net/2005-ols/paper-html/node6.html But I think preallocation in the context of a buddy/extent based allocator and free list manager makes even more sense. shaggy> but I am concerned that leaving preallocated, but shaggy> unused, blocks between actual file data But the unused bits would not be necessarily left there: because they would disappear on close, if any are left (that is the preallocation was overestimated, which for things like files written by 'tar' or 'gcc -c' would be impossible or rather rare) unless one marked the file as ''persistently preallocated'', in which case they would be e.g. for logs or virtual volume images ala DM/LVM2. shaggy> will result in more fragmentation, or just wasted space shaggy> on the disk. As to fragmentation, I suspect less, because of the particular aspects of a buddy system, which strongly favours keeping large blocks together so when they are deallocated they are whole related (thus splittable/coalesceable) subtrees. The wasted space can be minuscule if it is ''shrink-wrap on close'', or irrelevant if one sets something like a ''keep preallocation'' flag. Then there is the ''DM/LVM2'' replacement story... [ ... ] ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocationOn Mon, 2005-10-24 at 15:47 +0100, Peter Grandi wrote:
> >>> On Sun, 23 Oct 2005 22:53:02 -0500, Dave Kleikamp > >>> <shaggy@...> said: > > shaggy> On Mon, 2005-10-24 at 01:06 +0100, Peter Grandi wrote: > > [ ... ] > > >> * How to see how many extents of which size have been > >> allocated to an inode from the command line? > >> * How to list the free list from the command line? > > shaggy> There is no free list. The block map is a binary-buddy > shaggy> tree, where the leaves contain bitmaps. > > Ahhh, I mentioned that the free list is a buddy tree somewhere > else. I have done _some_ homework :-). > > shaggy> [ ... 'jfs_debugs' and 'xtree' and 'dmap' ... ] > shaggy> Hopefully, it's not too hard to navigate around. > > Well, I was angling for something neater than using > 'jfs_debugfs', but perhaps I can wrap it up in some script... Or use the jfs_debugfs source as a base for a less-interactive command. > >> * Suppose a filesystem is empty, and 'tar' extracts to it an > >> 8MiB file, writing 32KiB blocks. How many extents will it > >> span? After it has been written, what will the free list > >> look like? > > shaggy> I just tried creating a file with 'dd bs=32768 > shaggy> count=256' on a newly formatted jfs volume, and it went > shaggy> into one extent. > > Yes, this is more or less what I was expecting (that is, it > obviously first creates 32KiB extents and then coalesces them > after writing them). It doesn't work that way today. The blocks are actually allocated one page at a time and the extent is grown with each new allocation. Suparna Bhattacharya was working on changes to make mpage_writepages use the get_blocks function to allocate space for several pages at one time, but I'm not sure what the current status of that is. > [ ... ] > > shaggy> There's no code to do that now, but we could create an > shaggy> extent and mark it ABNR (allocated but not recorded). > shaggy> This is a holdover from OS/2, where the default behavior > shaggy> was to have dense files, rather than sparse ones. > > Uhm, instead of having ABNR one could simply have a the > convention that whole area between file size and length be > zero-on-demand, whether allocated or not. That's what happens today. > ABNR seems like for a > hole ''in the middle'' of a file, and the difference between > that and a preallocated area beyond its end is that while in the > former case the preallocated area is accessible, in the latter > is not, so if the former is unwritten one must flag it specially > or actually zero it; the latter needs neither (the flag is its > position). However on a 'seek' beyond the current file end might > require turning some of that into into ABNR extents or zeroing > them... Bah! Currently holes can exist either in the middle or at the end of a file. If there is no phyisical block mapped to a logical block of a file, it is read as zeros. I only suggested the ABNR extents as a way of preallocating contiguous space for the holes, since I thought that was what you were asking for. > shaggy> [ ... ] Currently, space is allocated one page at a > shaggy> time. [ ... appending ... ] we generally end up with > shaggy> large extents. > > Yes, but this results in rather hairy code to handle > after-the-fact coalescing of buddy extents (I guess the JFS > terminology is the ''backsplits'' mentioned in the comments), > and can result in other suboptimal behaviour, more later on > this. > > [ ... ] > > >> - a minimum extent size? for example to ensure that no extent > >> smaller than 1MiB is allocated? The purpose is to ensure > >> that files tend to be contiguous. > > shaggy> Non-trivial, but it shouldn't be a show-stopper. > shaggy> jfs_fsck would have to be taught to not flag an error or > shaggy> blocks are allocated beyond the file's size. > > I believe this would be a generally good idea, and modifying > 'jfs_fsck' for this might be pretty easy. I might try... Yeah, it should be easy. > shaggy> [ ... ] Would we need a mechanism to free unused > shaggy> preallocations? > > For an ''unconditional' minimum extent size that was a tunable, > I would not handle low space conditions at all. > > And actually in general because of two reasons: > > * In any case it is a user tunable. The use can always leave the > 1 block default unchanged. > > * If you have a low space condition where one (or even more than > one) 1MiB makes a difference on a say 40GiB filesystem, who > cares... > > * In any case several memory (but we can sort of rely on them > here too) allocation studies show that in most cases it is > pointless to handle low free memory conditions cleverly, > because in such cases it just delays an almost inevitable > overflow, does not avoid it. Again, especially if the margin > is of the order of 1MiB over a 40GiB arena. > > Also, preallocations can be ephemeral (disappear on close) or > persistent. The former are useful in their own right and have > very little If preallocations are freed at file close, I'm not sure there's an advantage over the current behavior where jfs locks out other allocations in the AG until the file is closed. > >> - alternatively, a default minimum extent size? So that the > >> extents are initially allocated of that size, but can be > >> reduced by 'close'(2) or 'ftruncate'(2) to the actual size > >> of the file. [ ... ] > > shaggy> A jfs volume is logically divided into a number of > shaggy> allocation groups (AGs). While a file is opened, jfs > shaggy> will always try to put allocations to other files in a > shaggy> separate AG. This generally works pretty well, [ ... ] > > Yes, this is like in BSD FFS and 'ext[23]' cylinder groups, > where it is also used to ensure all ''groups'' have some free > space. > > As you say, it works well especially to prevent interleaving on > parallel writes. But I surmise that a preallocate logic delivers > the same result more reliably, and a few more advantages. I can see preallocation adding more complexity to the code. Suppose we preallocate 1M when we first write 4K at the beginning of a file. We then seek out to say 256K and write another 4K. What do we do with the extent? Do we zero all the blocks in between, create an ABNR extent, break it up into smaller extents and leave a hole between them? > > >> - a maximum extent size? For example to ensure that no extent > >> larger than 256KiB is ever allocated? The purpose is to > >> minimize internal fragmentation by allocating only at the > >> lower levels of the buddy system. > > shaggy> I'm not sure what that would buy us. > > Well, it would prevent really large extents from happening. In a > buddy system very large allocations (wrt to the size of the > whole) cause trouble. I don't understand this. > >> I hope that the rationales are fairly clear; > > Well, perhaps a bit more is needed, so bear with me if I repeat > a bit of the obvious here, to illustrate the rest. > > A buddy system has two big problems for _memory_ allocation, > that it does not coalesce buddies that are not part of the same > branch (''buddy system'' in the language of the comments in the > JFS source) and that it can waste a lot of space in internal > fragmentation because of the power-of-2 thing. > > These problems are big for memory allocation, especially if > allocations are not very small with respect to the size of the > arena. But they are essentially irrelevant for file allocation, > and thus the use of a buddy system for JFS is fairly brilliant, > because files do not need to be wholly contiguous, unlike memory > blocks. it is only for performance reasons that they should be > mostly contiguous. > > The size of an optimally contiguous extent should be mostly > depending on the speed characteristics of the disc, which are > fairly constant. If one has discs with a seek time of 5-10ms, > that do a full rotation in 5-10ms, and have a transfer rate of > 50-100MiB/s that is 50-100KiB/ms, it is not that essential to > have 100MiB contiguous regions, a MiB at a time or so (that is a > few ms at a time) is pretty reasonable. > > So for example an 11MiB file should ideally be not in one 16MiB > extent, but in three ones, 8+2+1MiB (or arguably 8+4MiB), and > now I wonder what happens with JFS -- I shall try. I can see that having 3 extents is no worse, but I can't see why you would want to avoid a larger extent. > At the same time truly large extents ''tie up'' many higher > levels of the buddy tree, and too much of an AG. So instead of > having a say 1GiB file as a single extent, having it as several > 64MiB (or whatever) extents may be perfectly reasonable and give > some advantages as to performance. Even DM/LVM2 allocates > logical volumes with that kind of granularity. I still don't get the problem of tying up the buddy tree. If an extent takes up an entire AG, that's great. We've got a better chance of finding contiguous free space in other AGs. > Conversely, when allocating smaller files, having a minimum > advisory extent size of say 256KiB can help enesuring contiguity > even in massive multithreaded writes or on fragemented free lists. I guess you're not interested in efficient use of free space. > Also, in general preallocation (that is the ability of having > file length != size by semi-arbitrary amounts) can help > considerably in several special cases. > > I have in mind in particular these scenarios: > > * In many important cases the end size of a file is well known > in advance when the file is created. So one might as well > preallocate the whole file in advance. Agreed. If we know the size of the file, and that that data won't be sparse, it would be nice to allocate it in large contiguous pieces. > * In many important cases files are overwritten with contents of > much the same size as they were (e.g. recompiling a '.c' into > a '.o'). So one might as well, tentatively, preserve the > previously allocated size on a 'ftruncate', and then do it for > real on a 'close'. Interesting. We'd have to be careful about leaving stale data in pages that may not be written to. They would either have to be zero-filled, or have a hole punched into the file. (Does the compiler really truncate an existing file and re-write it, or does it completely replace the .o with a new file?) > The two above points (which would be greatly enhanced by trivial > changes to 'libc' and some common utilities) are relevant to me > because I care also about minimizing ''churn'' over the lifetime > of a filesystem, not just how good it performs freshly laid out, > where the free list is a single block to start with and remains > totally contiguous. > > Note: indeed considering that many filesystems are created > from a 'tar x' (resulting in a ''perfect'' layout) and then > updated, overwrites would help preserving preserve the > initially ''perfect'' layout. > > I have another scenario in mind: > > * DM is basically a simple minded ''manual allocation of > extents'' filesystem, and LVM2 is basically '-o loop' over it. > > * Imagine a 2300GiB JFS filesystem, with a minimum extent size > of 1GiB and a maximum extent size of say 16GiB (never mind the > AG limits :->), mounted perhaps with '-o nointegrity'. Uh, you'd be willing to lose everything if your system crashed or lost power? If not, you don't want nointegrity. > * Such a filesystem plus '-o loop' (built on 'md' if needed) > looks to me like a ''for free'' LVM, and with essentially the > same performance, and with no need for special utilities or > configuration. You're losing me here. I don't think we need a filesystem to replace lvm. > >> [ ... ] part of that is to short circuit when possible the > >> somewhat hairy ''hint'' related logic in 'jfs_dmap.c' and > >> that in 'jfs_open()' for example. > > shaggy> I don't understand the problem with the "hint". The > shaggy> hints are used to attempt to allocated file data near > shaggy> the inode, or to append onto existing extents when the > shaggy> following blocks are available. > > Ah yes sure, but they have a hairy logic and they have > conceivale performance limitations. > > My understanding of the current logic is to allocate small > extents (the size of a 'write'(2) I guess), use the hint to put > them near each other and then coalesce them if possible (if the > hints worked that well). Ideally, the size of the extents would be the size of the write. In most cases, we are doing allocation a page at a time. If there is space immediately after the previous extent, that extent is extended to contain the new blocks, so there is no coalescing going on. > But this is complex and has some > limitations (which do not happen if one is doing an initial load > of a filesystem, but otherwise may bite). It is designed to make > the best of a bad lot, if one does not know in advance the end > size of a file. It is not as complex as preallocation. Even if we know in advance the size of the file, we would have to make sure that unwritten pages are zeroed, either by physically writing zero to disk, or punching holes in the extent (either a real hole, or an ABNR extent). > Allocating larger extents to start with and then cut that down > on 'close' or whatever seems to me it could be a rather more > successful if the free list (the buddy tree) is already a bit > fragmented, and in any case handles several common cases where > the end size is known in advance directly. Yes, it could result in less fragmentation, but as I point out, it would be more complex. > Another way: the difference between attempting to allocate > directly a 1MiB extent and something like 64x16KiB ones is > that the first does a top-down scan of the buddy tree, the > second does in effect a bottom up one, trying to find a 1MiB > free block ''after the fact''. Even with hints, probably > latter only works well if the free list is really almost > unfragmented. Also, there may be several 1MiB free blocks in > a somewhat fragmented free list, but it is random whether the > 16KiB extents get allocated inside one, so that after-the-fact > coalescing them back into the original 1Mib one might not > work that well. Hmm. Maybe there's a compromise. When doing allocations for file data, jfs could search the binary buddy tree for an extent of a certain size (say 1 MB), but continue to allocate as it does. That way a sequentially-written file would grow contiguously into that space. > [ ... ] > > shaggy> I think preallocation may be useful in some > shaggy> circumstances, i.e. when a file is created > shaggy> non-sequentially, > > I think that apart from non-sequential or parallel writes (and > the AG switch helps in the latter case) or when the appends have > already happened, the free list (buddy tree) is already > fragemented (the the AG switch does not help). > > The general case for preallocation is made for example in this > 'ext[23]' paper: > > http://WWW.USENIX.org/events/usenix02/tech/freenix/full_papers/tso/tso_html/ This talks about the preallocation of large multimedia files, and how to tell the file system to preallocate the file. If we allow some explicit mechanism to preallocate a large file, I think we would have some options. Maybe we could implement dense files and use ABNR extents in some explicit cases. Again, if we have some way to know to begin a file where there is a lot of free space, and can lock out other allocations, we should get the desired results. I'm less concerned about the slow-growing files that will be appended to by different processes. I suspect that fragmentation of these files is not a real big problem. > and later updates like this (even if some of the arguments do > not apply that much to JFS): > > http://ext2.SourceForge.net/2005-ols/paper-html/node6.html This talks of fragmentation due to concurrent allocations, and jfs does tend to avoid that particular problem. > But I think preallocation in the context of a buddy/extent based > allocator and free list manager makes even more sense. I guess I'm uncomfortable preallocating all the time, since it will lead to more fragmentation. If every small file begins at a 1 MB offset, we'll have lots of free space in between these small allocations. > shaggy> but I am concerned that leaving preallocated, but > shaggy> unused, blocks between actual file data > > But the unused bits would not be necessarily left there: because > they would disappear on close, if any are left (that is the > preallocation was overestimated, which for things like files > written by 'tar' or 'gcc -c' would be impossible or rather rare) > unless one marked the file as ''persistently preallocated'', in > which case they would be e.g. for logs or virtual volume images > ala DM/LVM2. The freed space would only be available for metadata, since you propose that any new file begin with a large allocation. > shaggy> will result in more fragmentation, or just wasted space > shaggy> on the disk. > > As to fragmentation, I suspect less, because of the particular > aspects of a buddy system, which strongly favours keeping large > blocks together so when they are deallocated they are whole > related (thus splittable/coalesceable) subtrees. The wasted > space can be minuscule if it is ''shrink-wrap on close'', or > irrelevant if one sets something like a ''keep preallocation'' > flag. shrink-wrap on close: you propose moving the data then? We'd be better off with allocation on close, rather than preallocation in this case. > Then there is the ''DM/LVM2'' replacement story... I don't want to go there. :^) -- David Kleikamp IBM Linux Technology Center ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocation>>> On Wed, 26 Oct 2005 11:05:40 -0500, Dave Kleikamp
>>> <shaggy@...> said: BTW, in this discussion if we were in the same room and with a blackboard for doing a couple of pictures I could get across my guesses/points a lot easier and quicker and with less repetition. This is such a narrow bandwidth medium... Oh well :-/. [ ... ] >> Yes, this is more or less what I was expecting (that is, it >> obviously first creates 32KiB extents and then coalesces them >> after writing them). shaggy> It doesn't work that way today. The blocks are actually shaggy> allocated one page at a time and the extent is grown shaggy> with each new allocation. [ ... ] Uh it is _really_ block-at-a-time. My previous understanding was that it was a buddy allocator with a block-scoring bitmap, but it seems it is instead really a bitmap allocator with a tree index; or perhaps not... [ ... ] shaggy> Currently holes can exist either in the middle or at the shaggy> end of a file. If there is no phyisical block mapped to shaggy> a logical block of a file, it is read as zeros. I only shaggy> suggested the ABNR extents as a way of preallocating shaggy> contiguous space for the holes, since I thought that was shaggy> what you were asking for. Yes, but the unwritten bit at the end does not need a special extent type, it can be part of an existing extent, because that it is unwritten-but-zero is implied in its position, which is not really the case for a hole in the middle of a file. The scheme some people have been thinking of is to have _three_ ''file sizes'', in order of increasing (or same) value: * max bytes written; anything beyond this reads as zeroes. * max bytes readable; anything beyond this is not readable. * bytes actually allocated. For an empty but preallocated file of size N, it could be a single extent of size N. Then the three sizes would be initially like 0:0:N. Then a seek to the end would make them 0:N:N, and a write of 4KiB would make them 4096:N:N for example. shaggy> [ ... ] Would we need a mechanism to free unused shaggy> preallocations? >> [ ... ] Also, preallocations can be ephemeral (disappear on >> close) or persistent. [ ... ] shaggy> If preallocations are freed at file close, I'm not sure shaggy> there's an advantage over the current behavior where jfs shaggy> locks out other allocations in the AG until the file is shaggy> closed. Ahhhh, I can see some advantages, e.g. if the free list is fragmented. In part because one no longer need to scatter parallel writes across different AGs; I'd like to have higher chanced of keeping ''related'' files near each other in the same AG, not just blocks of the same file near each other. But also because if the free list is fragmented, there will be free blocks of potentially many different sizes in it, and preallocating from the beginning the final size means that the largest, or a large, contiguous block can be reserved (but I see that your compromise below would achieve this, so fine). >> - alternatively, a default minimum extent size? So that >> the extents are initially allocated of that size, but >> can be reduced by 'close'(2) or 'ftruncate'(2) to the >> actual size of the file. [ ... ] This is an alternative to whole file preallocation; just preallocate the minimum extent size whenever a write happens, around the address of that write for example. [ ... ] shaggy> I can see preallocation adding more complexity to the shaggy> code. Suppose we preallocate 1M when we first write 4K shaggy> at the beginning of a file. We then seek out to say 256K shaggy> and write another 4K. [ ... alternatives ... ] Well, in this case I'd just zero all the intervening blocks. It would not be, I hope :-), really that hard to do the other two things you mention (ABNR or extent splitting), but I guess that if one sets an option to preallocate I would assume that one wants dense files. I personally reckon that a second per-block bit (written or not, not just allocated or not) might be useful, but probably too late. However conceivably whether a block is allocated is probably recorded redundantly in the fact it is part of an extent and in the bitmap, so one could do some reinterpretation. Might be hairy though... Unless the option is for a minimum extent size, in which case one wants just chunky files (that is files with holes, where however allocated bit and unallocated bits come in much larger chunks that a single block). >> - a maximum extent size? For example to ensure that no extent >> larger than 256KiB is ever allocated? [ ... ] shaggy> I'm not sure what that would buy us. >> Well, it would prevent really large extents from happening. In >> a buddy system very large allocations (wrt to the size of the >> whole) cause trouble. shaggy> I don't understand this. Just speculating... In general theoretical terms, Knuth in analyzing the buddy system (for RAM allocation though) says that it has good performance, which is a surprise, as it has the two problems of potentially lots of internal fragmentation because of the power-of-two issue, and of free list fragmentation because of the coalesce-only-buddies issue. The good performance happens because in his tests most of the blocks are rather small wrt to the size of the arena. So, well, perhaps I don't know how JFS really allocates stuff, but suppose that one creates a 5GiB file, preallocated or written, and suppose there is a free 8GiB block. 3GiB might get wasted because of the power-of-two issue. Now suppose the maximum extent size allowed is 1GiB. We get 5 1GiB extents allocated, and then 1GiB+2GiB free block. This seems to me a better outcome than one 8GiB extent, and this is a good thing that can't be done with a RAM buddy allocator. There is also a bit of the BSD FFS/'ext3' logic that when writing large files they switch cylinder group every now and then, so that both small and large files in the same directory say can be (begin) nearby in disc distance... This is not per-file locality, but per-(sub)tree locality, which I think often matters too. >> So for example an 11MiB file should ideally be not in one >> 16MiB extent, but in three ones, 8+2+1MiB (or arguably >> 8+4MiB), and now I wonder what happens with JFS -- I shall >> try. shaggy> I can see that having 3 extents is no worse, but I can't shaggy> see why you would want to avoid a larger extent. Again, if it is no worse, why not give the option ''just-in-case''? :-) But more seriously for example because in this case one saves 5GiB, if the single larger extent must be 16GiB for an 11GB file. Those 5GiB of internal fragmentation not only waste space, they make the arm travel further over unused data. [ ... ] shaggy> I still don't get the problem of tying up the buddy shaggy> tree. If an extent takes up an entire AG, that's great. shaggy> We've got a better chance of finding contiguous free shaggy> space in other AGs. Yes, if all you care is single-file-performance, and performance just after a fresh load. But suppose one cares also about keeping files that are logically nearby (e.g. in the same directory) nearby on the disc too, and what happens when the free list becomes fragmented. And as long as file bodies are _mostly_ contiguous, that's fine. >> Conversely, when allocating smaller files, having a minimum >> advisory extent size of say 256KiB can help ensuring >> contiguity even in massive multithreaded writes or on >> fragemented free lists. shaggy> I guess you're not interested in efficient use of free shaggy> space. Well, if the user sets a minimum (fixed or default) extent size, that's a tradeoff they make knowing what they are doing. But as remarked below, I may not have made it too clear that I would have such options for adjusting allocation/preallocation granularity default to 0 or infinity (for min/max granule), so by default allocation would be exactly as it is now. [ ... ] >> * In many important cases files are overwritten with contents >> of much the same size as they were (e.g. recompiling a '.c' >> into a '.o'). So one might as well, tentatively, preserve >> the previously allocated size on a 'ftruncate', and then do >> it for real on a 'close'. shaggy> Interesting. We'd have to be careful about leaving shaggy> stale data in pages that may not be written to. They shaggy> would either have to be zero-filled, or have a hole shaggy> punched into the file. That's why ideally one has a ''max byte written so far'' high water mark, not just the ''max readable' and ''max allocated'' ones. My expectation is that seeking around while writing is actually rather rare... The other classic example is repeated package ('.rpm', '.deb') upgrades. Almost always the upgraded packages has the same files with the same or much the same sizes, just different contents. shaggy> (Does the compiler really truncate an existing file and shaggy> re-write it, or does it completely replace the .o with a shaggy> new file?) Most such programs are stupid unfortunately. But modifying them is very easy (compiler, 'tar', 'cp', ...), and even easier and probably almost as good is to modify what 'stdio' does for example with a 'fopen(....,"w")': instead of that becoming an 'open'(2) with 'O_CREAT', which deallocates the existing blocks, do it with 'O_RDWR', and then 'ftruncate' on 'fclose'(3). One of the scandals of our modern times is that various libcs and kernels dont take advantage of the useful implicit hints in 'fopen'(3)/'open'(2) options, both as to allocation and read/write clustering and access patterns. One could also easily modify the 'open'(2) implementation to make 'O_CREAT' equivalent to 'O_RDWR' plus resetting the ''max readable'' watermark to zero, and then auto-truncate on 'close'(2)'. Either could be done by 'LD_PRELOAD' of a suitable set of wrappers, at least initially. [ ... ] >> * Imagine a 2300GiB JFS filesystem, with a minimum extent >> size of 1GiB and a maximum extent size of say 16GiB (never >> mind the AG limits :->), mounted perhaps with '-o >> nointegrity'. shaggy> Uh, you'd be willing to lose everything if your system shaggy> crashed or lost power? If not, you don't want nointegrity. Ahhhh, but all that journaling does in JFS so far is to protect _metadata_ transactions. Once the virtual volumes are created, there are no further metadata updates, except perhaps for the inode time fields. Admittedly by the same argument there is not much point in disabling journaling for the ''pool'' JFS filesystem. All the journaling that matters would happen _inside_ the files (virtual volumes), and I would not disable that... >> * Such a filesystem plus '-o loop' (built on 'md' if needed) >> looks to me like a ''for free'' LVM, and with essentially >> the same performance, and with no need for special >> utilities or configuration. shaggy> You're losing me here. I don't think we need a shaggy> filesystem to replace lvm. Yes, but if a filesystem can perform nearly as well a DM/LVM2, having the option to use it like that seems to me to be rather valuable, if only for the sake of minimizing entities. For example, one of the major uses of DM/LVM2 is for Oracle tablespaces, for two reasons: * Tablespaces should be ideally contiguous and low overhead, so partitions are often used (even if some people think this is not necessary). * Many Oracle databases have hundreds of them, and one can only create so many real partitions, and managing them is a pain regardless. It is therefore in this case that DM/LVM2 are used as a crude replacement for JFS. Now consider: instead of creating hundreds of logical volumes with DM/LVM2, just create hundreds of ordinary preallocated files with JFS with high minimum extent size and somewhat higher maximum extent size. Quick and easy and same performance. And I am fairly sure of this: because this article: http://WWW.Oracle.com/technology/oramag/webcolumns/2002/techarticles/scalzo_linux02.html says that raw tablespaces (under DM/LVM2) are _slower_ than using 'ext3' files (and JFS does not do too bad). My guess is that because these are obviously (like in most naive benchmarks) freshly loaded filesystems, and 'ext3' achieves optimal layout on freshly loaded data (and JFS almost). So my further guess is that if the layout is good, a file system can beat DM/LVM2 at its own game, because DM/LVM 2 are in effect a crude large-extent large-file filesystem. >> My understanding of the current logic is to allocate small >> extents (the size of a 'write'(2) I guess), use the hint to >> put them near each other and then coalesce them if possible >> (if the hints worked that well). shaggy> Ideally, the size of the extents would be the size of shaggy> the write. In most cases, we are doing allocation a page shaggy> at a time. If there is space immediately after the shaggy> previous extent, that extent is extended to contain the shaggy> new blocks, so there is no coalescing going on. Just nitpicking, but to illustrate my mental model of JFS: that «is extended» is in effect coalescing, unless the buddy system is really a fiction. Suppose that 2GiB have been just written in the currently open extent, and that these 2GiB are all inside the first of a pair of two 2GiB buddies. Write another byte and you need to allocate the second 2GiB buddy, thus coalescing it with the first one, effectively now the 2GiB+1 extent is contained in a 4GiB buddy. shaggy> It is not as complex as preallocation. Even if we know shaggy> in advance the size of the file, we would have to make shaggy> sure that unwritten pages are zeroed, either by shaggy> physically writing zero to disk, or punching holes in shaggy> the extent (either a real hole, or an ABNR extent). Or use the written-readable-allocate ''sizes''. I would expect most preallocations to be for sequentially written files, so not worry much about holes in the middle. [ ... top-down vs. bottom up allocation ... ] shaggy> Hmm. Maybe there's a compromise. When doing allocations shaggy> for file data, jfs could search the binary buddy tree shaggy> for an extent of a certain size (say 1 MB), but continue shaggy> to allocate as it does. That way a sequentially-written shaggy> file would grow contiguously into that space. Yes, that seems quite a good idea as it would most likely achieve the same effect with minimal code disruption. Now, this is equivalent as the whole AG is locked, so in effect the 1MiB buddy is preallocated. [ ... ] >> The general case for preallocation is made for example in >> this 'ext[23]' paper: >> http://WWW.USENIX.org/events/usenix02/tech/freenix/full_papers/tso/tso_html/ shaggy> [ ... ] If we allow some explicit mechanism to shaggy> preallocate a large file, I think we would have some shaggy> options. Yes, and there are two ways to do so, in-band and out-of-band; in-band, which is what the 'ext3' guys are thinking mostly about, may require changing APIs or adding extended attributes to files. My ''on-the-cheap'' preference is for out-of-band options, that is either global or per-filesystem. shaggy> Maybe we could implement dense files and use ABNR shaggy> extents in some explicit cases. Again, if we have some shaggy> way to know to begin a file where there is a lot of free shaggy> space, and can lock out other allocations, we should get shaggy> the desired results. Yes, that sound reasonable. ABNR as indicated elsewhere is probably not that needed because I expect more preallocation to be sequential (no holes in the middle) or to be by large granule extents, with holes in the middle of large chunks. [ ... ] shaggy> I guess I'm uncomfortable preallocating all the time, shaggy> since it will lead to more fragmentation. If every shaggy> small file begins at a 1 MB offset, we'll have lots of shaggy> free space in between these small allocations. [ ... ] This is a big misunderstanding, sorry; I did say that I would like either a global or per-filesystem set of options, like: * 'smallest-extent-size', [default 0]: no newly created extent can be smaller than this. * 'default-extent-size' [default 0]: extents are initially created this small, and the last one is truncated on close. * 'largest-extent-size', [default 0 to mean infinity]: no newly created extent can be larger than this. These could be either in '/proc/fs/jfs/' (global), or as mount options (per-filesystem). Then also ideally 'ftruncate' would also support the ''truncate to a larger size than the current one'' semantics (obeying the options above too). >> Then there is the ''DM/LVM2'' replacement story... shaggy> I don't want to go there. :^) Oh no.... :-) But again, suppose that you can create a 2300GiB JFS filesystem over an MD RAID, and within this you can efficiently create (which means, preallocated, mostly contiguous, unwritten) 50-500GiB files in say 1-10GiB extents, say for tablespaces, or virtual machine discs, to be mounted '-o loop'... Sure, one can do that now but there are a couple of annoyances: * To ensure best contiguity all the volumes should be created just after 'jfs_mkfs', when the free list is mostly contiguous and the buddy system wholesome. And this may achieve too much contiguity. * The big files need to be actually _written to_ to achieve allocation. One would either preallocate the large files, or set a minimum mandatory extent size of say 4GiB and not preallocate, and let the filesystem be allocated in 4GiB chunks. It would sound wonderful to me... MD/JFS/'loop' would then do at least 90% of what DM/LVM2 do, for free. It would be 100% if you implemented reverse-copy-on-write (that is, snapshot) files :-). BTW, the Oracle tablespace and VM virtual disc stories are part of my interests, and these are about DM/LVM2 replacement. But I am also interested in the upgrade-the-installed-packages and the archive-of-DVD-images for a video-on-demand server stories, in case that was not obvious. All these would rather benefit from preallocation either whole-file or chunky-extents... ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocationHi,
I am working on a digital video recorder. The system is linux based and there are 16 video sources. The video sources write data the data disk syncronously. Once it is filled up, there is a recycle mechanism which will remove the old video files and free up new space. As you can imagined, there will be serious external fragmentation problem as time passes. I was told that jfs and xfs can do much better than ext3, to tackle the fragment problem, so I conducted a few benchmark tests and found that jfs is doing excellent job. The findings is not strong enough to persuade my boss to change, and hence I've been reading the rationale and source code behind jfs. Here I summarized two major questions that can help me explain jfs's magic: 1. when I write the first byte or open a file, what will jfs do? cuz my findings is that, the 16 channels create files of size around 32MB. They grow in size of course, but majority fragment or number of extents I found is only ONE... according to ur disscussion with Peter, jfs allocates one page to a file at a time. and this allocation is locked under one allocation group. the page size according to jfs_filesys.h is 4096. You said the allocation would be allocated but not recorded (ABNR), which raised two subquestions: 1a. is those ABNR blocks stored temporary in memory, 16 files on grow and towards 32MB, it is a huge memory requirement. is it really that everyone stored in memory and flushed to the disk at file close?? what is the jfs memory requirement then? 1b. since only one file is allocated in one allocation group (AG), then how many AG is there in ur disk when it's formated (mkfs)? and is there an upper bound for the maximum number of files which can be opened and written at the same moment in jfs?? 2. jfs is so called extent-based allocation. how jfs knows the right size of extent to allocate to a fixed file? and growing file? the stat i found shows that majority of my files ( <= 32MB ) are single fragment file (number of extent = 1). I would really like to understand the "magic" how it can be achieved. The findings and rationale behind will lead us to a filesystem change. I would be very gladful if anyone can help me. Thankyou very much! |
|
|
Re: Some more questions, preallocationThe notion of an extant is that it can grow upto apt where it doesn't
reach another extant on disk. Its extensible till space permits, and unlike a disk block which can be of 4096/8192 bytes etc. That explains why you see only 1 extant having internal fragmentation. You should see some more fragmentation if the disk is heavily used up in terms of adding/deketing files upto max capacity of disk. If your boss doesn't like JFS, you can use other extant based filesystems like VxFS. regards -kamal On 7/10/06, Cosmo Nova <cs_mcc98@...> wrote: > > Hi, > > I am working on a digital video recorder. The system is linux based and > there are 16 video sources. The video sources write data the data disk > syncronously. Once it is filled up, there is a recycle mechanism which will > remove the old video files and free up new space. As you can imagined, there > will be serious external fragmentation problem as time passes. > I was told that jfs and xfs can do much better than ext3, to tackle the > fragment problem, so I conducted a few benchmark tests and found that jfs is > doing excellent job. The findings is not strong enough to persuade my boss > to change, and hence I've been reading the rationale and source code behind > jfs. Here I summarized two major questions that can help me explain jfs's > magic: > > 1. when I write the first byte or open a file, what will jfs do? cuz my > findings is that, the 16 channels create files of size around 32MB. They > grow in size of course, but majority fragment or number of extents I found > is only ONE... > > according to ur disscussion with Peter, jfs allocates one page to a file at > a time. and this allocation is locked under one allocation group. the page > size according to jfs_filesys.h is 4096. You said the allocation would be > allocated but not recorded (ABNR), which raised two subquestions: > 1a. is those ABNR blocks stored temporary in memory, 16 files on grow and > towards 32MB, it is a huge memory requirement. is it really that everyone > stored in memory and flushed to the disk at file close?? what is the jfs > memory requirement then? > 1b. since only one file is allocated in one allocation group (AG), then how > many AG is there in ur disk when it's formated (mkfs)? and is there an upper > bound for the maximum number of files which can be opened and written at the > same moment in jfs?? > > 2. jfs is so called extent-based allocation. how jfs knows the right size of > extent to allocate to a fixed file? and growing file? the stat i found shows > that majority of my files ( <= 32MB ) are single fragment file (number of > extent = 1). I would really like to understand the "magic" how it can be > achieved. > > The findings and rationale behind will lead us to a filesystem change. I > would be very gladful if anyone can help me. Thankyou very much! > -- > View this message in context: > http://www.nabble.com/Some-more-questions%2C-preallocation-tf440979.html#a5247869 > Sent from the JFS - General forum at Nabble.com. > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Jfs-discussion mailing list > Jfs-discussion@... > https://lists.sourceforge.net/lists/listinfo/jfs-discussion > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocationOn Mon, 2006-07-10 at 01:22 -0700, Cosmo Nova wrote:
> Hi, > > I am working on a digital video recorder. The system is linux based and > there are 16 video sources. The video sources write data the data disk > syncronously. Once it is filled up, there is a recycle mechanism which will > remove the old video files and free up new space. As you can imagined, there > will be serious external fragmentation problem as time passes. > I was told that jfs and xfs can do much better than ext3, to tackle the > fragment problem, so I conducted a few benchmark tests and found that jfs is > doing excellent job. The findings is not strong enough to persuade my boss > to change, and hence I've been reading the rationale and source code behind > jfs. Here I summarized two major questions that can help me explain jfs's > magic: > > 1. when I write the first byte or open a file, what will jfs do? cuz my > findings is that, the 16 channels create files of size around 32MB. They > grow in size of course, but majority fragment or number of extents I found > is only ONE... > > according to ur disscussion with Peter, jfs allocates one page to a file at > a time. and this allocation is locked under one allocation group. the page > size according to jfs_filesys.h is 4096. You said the allocation would be > allocated but not recorded (ABNR), No, jfs doesn't use the ABNR blocks currently. This is something the OS/2 supported, but the linux implementation will not create abnr extents. > which raised two subquestions: > 1a. is those ABNR blocks stored temporary in memory, 16 files on grow and > towards 32MB, it is a huge memory requirement. is it really that everyone > stored in memory and flushed to the disk at file close?? what is the jfs > memory requirement then? The blocks are allocated when the data is written. The data may be stored in memory for a while, but can be written to disk at any time. > 1b. since only one file is allocated in one allocation group (AG), then how > many AG is there in ur disk when it's formated (mkfs)? There are typically, somewhere between 65 and 128 AG's on a disk. The minimum size of the allocation group is 8K blocks, or 32 MB, so smaller volumes may contain fewer than that (possibly 1). > and is there an upper > bound for the maximum number of files which can be opened and written at the > same moment in jfs?? No, the "locks" on the AG due to an open file being written to it are not absolute. If a new allocation is needed and no free blocks are available in an unlocked AG, it will find space in the locked AG, which will probably lead to fragmentation of the files being created. > 2. jfs is so called extent-based allocation. how jfs knows the right size of > extent to allocate to a fixed file? The extent is initially small, and grows as long as new allocations are contiguous with the already-allocate blocks. > and growing file? Whenever we're allocating space to the end of a file, the allocator tries to use the blocks immediately after the last allocated block. As long as these blocks (1 block typically) are free, the existing extent is grown to include the new blocks. > the stat i found shows > that majority of my files ( <= 32MB ) are single fragment file (number of > extent = 1). I would really like to understand the "magic" how it can be > achieved. It seems as if this "magic" works pretty well. :-) I originally came up with this idea of "locking" the AGs to avoid fragmentation because it was easier and quicker to implement than preallocation or delayed allocation. > > The findings and rationale behind will lead us to a filesystem change. I > would be very gladful if anyone can help me. Thankyou very much! I hope my answers were helpful. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
|
|
Re: Some more questions, preallocationDo you mean JFS allocate a big collection of blocks (extents) of size larger than required? Since I was checking on external fragmentation, so I missed the serious internal fragmentation?? But how would it deal with growing files?
Actually we are using XFS, but we heard experience from other parties that JFS can do better in our application type and thus we want a move. But we have to find a more concrete reason to verify our action... :P |
|
|
Re: Some more questions, preallocationI have some further questions concerning the allocation groups. You mentioned that the typical number of allocation group is 65-128, and the minimum size of them is 8K blocks or 32MB. But how are the AGs physically located? Does it physically mark the JFS volume into (let say) 128 “sub-partitions”? And let new files opened and written to different “sub-partition”?
As I mentioned, the experiment I run on nearly constant file size is excellent. How about variable size files? If files of variable size grow in the AGs and there are regular deletions, how would JFS deal with the potential fragmentation? Finally, you mentioned blocks of 4KB is allocated to growing files. Is there any internal fragmentation problem to file extents? Will it allocate more space than required to avoid external fragmentation? Thanks! |
|
|
Re: Some more questions, preallocationOn Wed, 2006-07-12 at 01:13 -0700, Cosmo Nova wrote:
> I have some further questions concerning the allocation groups. You mentioned > that the typical number of allocation group is 65-128, and the minimum size > of them is 8K blocks or 32MB. But how are the AGs physically located? Does > it physically mark the JFS volume into (let say) 128 ?sub-partitions?? And > let new files opened and written to different ?sub-partition?? The volume is logically partitioned into allocation groups (AGs). The primary purpose of AGs is to aid in locality. We attempt to allocate file inodes in the same AG as the parent directory, and file blocks in the same AG as the inode or other blocks already allocated to the file. Additionally, low-level locking of the allocation structures is done within the AGs when possible to allow concurrent allocations in different AGs. Locking the AG's while open files are being written to avoids fragmentation within the files, but does have an adverse affect on the locality. If an AG is "locked", data for a file may be allocated from a different AG than it's inode is located. > As I mentioned, the experiment I run on nearly constant file size is > excellent. How about variable size files? If files of variable size grow in > the AGs and there are regular deletions, how would JFS deal with the > potential fragmentation? I don't have any empirical data, but I would say it depends on the size of the files. If there were a lot of small files being allocated and freed at different intervals, it may leave the free space fragmented such that new allocations would be piecemeal and you would be more likely to get fragmented files. If you were dealing with larger variable sized files, allocations may still be more fragmented, but probably not as bad. Instead of having one large extent, you may have 3 or 4 extents which still define a large number of blocks each. > Finally, you mentioned blocks of 4KB is allocated to growing files. Is there > any internal fragmentation problem to file extents? Will it allocate more > space than required to avoid external fragmentation? JFS doesn't allocate any more blocks than are written to, so the space lost to internal fragmentation will never exceed 4K. > Thanks! -- David Kleikamp IBM Linux Technology Center ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Jfs-discussion mailing list Jfs-discussion@... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |
| Free embeddable forum powered by Nabble | Forum Help |