On Mon, 2005-10-24 at 15:47 +0100, Peter Grandi wrote:
> >>> On Sun, 23 Oct 2005 22:53:02 -0500, Dave Kleikamp
> >>> <shaggy@...> said:
> shaggy> On Mon, 2005-10-24 at 01:06 +0100, Peter Grandi wrote:
> [ ... ]
> >> * How to see how many extents of which size have been
> >> allocated to an inode from the command line?
> >> * How to list the free list from the command line?
> shaggy> There is no free list. The block map is a binary-buddy
> shaggy> tree, where the leaves contain bitmaps.
> Ahhh, I mentioned that the free list is a buddy tree somewhere
> else. I have done _some_ homework :-).
> shaggy> [ ... 'jfs_debugs' and 'xtree' and 'dmap' ... ]
> shaggy> Hopefully, it's not too hard to navigate around.
> Well, I was angling for something neater than using
> 'jfs_debugfs', but perhaps I can wrap it up in some script...
Or use the jfs_debugfs source as a base for a less-interactive command.
> >> * Suppose a filesystem is empty, and 'tar' extracts to it an
> >> 8MiB file, writing 32KiB blocks. How many extents will it
> >> span? After it has been written, what will the free list
> >> look like?
> shaggy> I just tried creating a file with 'dd bs=32768
> shaggy> count=256' on a newly formatted jfs volume, and it went
> shaggy> into one extent.
> Yes, this is more or less what I was expecting (that is, it
> obviously first creates 32KiB extents and then coalesces them
> after writing them).
It doesn't work that way today. The blocks are actually allocated one
page at a time and the extent is grown with each new allocation.
Suparna Bhattacharya was working on changes to make mpage_writepages use
the get_blocks function to allocate space for several pages at one time,
but I'm not sure what the current status of that is.
> [ ... ]
> shaggy> There's no code to do that now, but we could create an
> shaggy> extent and mark it ABNR (allocated but not recorded).
> shaggy> This is a holdover from OS/2, where the default behavior
> shaggy> was to have dense files, rather than sparse ones.
> Uhm, instead of having ABNR one could simply have a the
> convention that whole area between file size and length be
> zero-on-demand, whether allocated or not.
That's what happens today.
> ABNR seems like for a
> hole ''in the middle'' of a file, and the difference between
> that and a preallocated area beyond its end is that while in the
> former case the preallocated area is accessible, in the latter
> is not, so if the former is unwritten one must flag it specially
> or actually zero it; the latter needs neither (the flag is its
> position). However on a 'seek' beyond the current file end might
> require turning some of that into into ABNR extents or zeroing
> them... Bah!
Currently holes can exist either in the middle or at the end of a file.
If there is no phyisical block mapped to a logical block of a file, it
is read as zeros. I only suggested the ABNR extents as a way of
preallocating contiguous space for the holes, since I thought that was
what you were asking for.
> shaggy> [ ... ] Currently, space is allocated one page at a
> shaggy> time. [ ... appending ... ] we generally end up with
> shaggy> large extents.
> Yes, but this results in rather hairy code to handle
> after-the-fact coalescing of buddy extents (I guess the JFS
> terminology is the ''backsplits'' mentioned in the comments),
> and can result in other suboptimal behaviour, more later on
> [ ... ]
> >> - a minimum extent size? for example to ensure that no extent
> >> smaller than 1MiB is allocated? The purpose is to ensure
> >> that files tend to be contiguous.
> shaggy> Non-trivial, but it shouldn't be a show-stopper.
> shaggy> jfs_fsck would have to be taught to not flag an error or
> shaggy> blocks are allocated beyond the file's size.
> I believe this would be a generally good idea, and modifying
> 'jfs_fsck' for this might be pretty easy. I might try...
Yeah, it should be easy.
> shaggy> [ ... ] Would we need a mechanism to free unused
> shaggy> preallocations?
> For an ''unconditional' minimum extent size that was a tunable,
> I would not handle low space conditions at all.
> And actually in general because of two reasons:
> * In any case it is a user tunable. The use can always leave the
> 1 block default unchanged.
> * If you have a low space condition where one (or even more than
> one) 1MiB makes a difference on a say 40GiB filesystem, who
> * In any case several memory (but we can sort of rely on them
> here too) allocation studies show that in most cases it is
> pointless to handle low free memory conditions cleverly,
> because in such cases it just delays an almost inevitable
> overflow, does not avoid it. Again, especially if the margin
> is of the order of 1MiB over a 40GiB arena.
> Also, preallocations can be ephemeral (disappear on close) or
> persistent. The former are useful in their own right and have
> very little
If preallocations are freed at file close, I'm not sure there's an
advantage over the current behavior where jfs locks out other
allocations in the AG until the file is closed.
> >> - alternatively, a default minimum extent size? So that the
> >> extents are initially allocated of that size, but can be
> >> reduced by 'close'(2) or 'ftruncate'(2) to the actual size
> >> of the file. [ ... ]
> shaggy> A jfs volume is logically divided into a number of
> shaggy> allocation groups (AGs). While a file is opened, jfs
> shaggy> will always try to put allocations to other files in a
> shaggy> separate AG. This generally works pretty well, [ ... ]
> Yes, this is like in BSD FFS and 'ext' cylinder groups,
> where it is also used to ensure all ''groups'' have some free
> As you say, it works well especially to prevent interleaving on
> parallel writes. But I surmise that a preallocate logic delivers
> the same result more reliably, and a few more advantages.
I can see preallocation adding more complexity to the code. Suppose we
preallocate 1M when we first write 4K at the beginning of a file. We
then seek out to say 256K and write another 4K. What do we do with the
extent? Do we zero all the blocks in between, create an ABNR extent,
break it up into smaller extents and leave a hole between them?
> >> - a maximum extent size? For example to ensure that no extent
> >> larger than 256KiB is ever allocated? The purpose is to
> >> minimize internal fragmentation by allocating only at the
> >> lower levels of the buddy system.
> shaggy> I'm not sure what that would buy us.
> Well, it would prevent really large extents from happening. In a
> buddy system very large allocations (wrt to the size of the
> whole) cause trouble.
I don't understand this.
> >> I hope that the rationales are fairly clear;
> Well, perhaps a bit more is needed, so bear with me if I repeat
> a bit of the obvious here, to illustrate the rest.
> A buddy system has two big problems for _memory_ allocation,
> that it does not coalesce buddies that are not part of the same
> branch (''buddy system'' in the language of the comments in the
> JFS source) and that it can waste a lot of space in internal
> fragmentation because of the power-of-2 thing.
> These problems are big for memory allocation, especially if
> allocations are not very small with respect to the size of the
> arena. But they are essentially irrelevant for file allocation,
> and thus the use of a buddy system for JFS is fairly brilliant,
> because files do not need to be wholly contiguous, unlike memory
> blocks. it is only for performance reasons that they should be
> mostly contiguous.
> The size of an optimally contiguous extent should be mostly
> depending on the speed characteristics of the disc, which are
> fairly constant. If one has discs with a seek time of 5-10ms,
> that do a full rotation in 5-10ms, and have a transfer rate of
> 50-100MiB/s that is 50-100KiB/ms, it is not that essential to
> have 100MiB contiguous regions, a MiB at a time or so (that is a
> few ms at a time) is pretty reasonable.
> So for example an 11MiB file should ideally be not in one 16MiB
> extent, but in three ones, 8+2+1MiB (or arguably 8+4MiB), and
> now I wonder what happens with JFS -- I shall try.
I can see that having 3 extents is no worse, but I can't see why you
would want to avoid a larger extent.
> At the same time truly large extents ''tie up'' many higher
> levels of the buddy tree, and too much of an AG. So instead of
> having a say 1GiB file as a single extent, having it as several
> 64MiB (or whatever) extents may be perfectly reasonable and give
> some advantages as to performance. Even DM/LVM2 allocates
> logical volumes with that kind of granularity.
I still don't get the problem of tying up the buddy tree. If an extent
takes up an entire AG, that's great. We've got a better chance of
finding contiguous free space in other AGs.
> Conversely, when allocating smaller files, having a minimum
> advisory extent size of say 256KiB can help enesuring contiguity
> even in massive multithreaded writes or on fragemented free lists.
I guess you're not interested in efficient use of free space.
> Also, in general preallocation (that is the ability of having
> file length != size by semi-arbitrary amounts) can help
> considerably in several special cases.
> I have in mind in particular these scenarios:
> * In many important cases the end size of a file is well known
> in advance when the file is created. So one might as well
> preallocate the whole file in advance.
Agreed. If we know the size of the file, and that that data won't be
sparse, it would be nice to allocate it in large contiguous pieces.
> * In many important cases files are overwritten with contents of
> much the same size as they were (e.g. recompiling a '.c' into
> a '.o'). So one might as well, tentatively, preserve the
> previously allocated size on a 'ftruncate', and then do it for
> real on a 'close'.
Interesting. We'd have to be careful about leaving stale data in pages
that may not be written to. They would either have to be zero-filled,
or have a hole punched into the file. (Does the compiler really
truncate an existing file and re-write it, or does it completely replace
the .o with a new file?)
> The two above points (which would be greatly enhanced by trivial
> changes to 'libc' and some common utilities) are relevant to me
> because I care also about minimizing ''churn'' over the lifetime
> of a filesystem, not just how good it performs freshly laid out,
> where the free list is a single block to start with and remains
> totally contiguous.
> Note: indeed considering that many filesystems are created
> from a 'tar x' (resulting in a ''perfect'' layout) and then
> updated, overwrites would help preserving preserve the
> initially ''perfect'' layout.
> I have another scenario in mind:
> * DM is basically a simple minded ''manual allocation of
> extents'' filesystem, and LVM2 is basically '-o loop' over it.
> * Imagine a 2300GiB JFS filesystem, with a minimum extent size
> of 1GiB and a maximum extent size of say 16GiB (never mind the
> AG limits :->), mounted perhaps with '-o nointegrity'.
Uh, you'd be willing to lose everything if your system crashed or lost
power? If not, you don't want nointegrity.
> * Such a filesystem plus '-o loop' (built on 'md' if needed)
> looks to me like a ''for free'' LVM, and with essentially the
> same performance, and with no need for special utilities or
You're losing me here. I don't think we need a filesystem to replace
> >> [ ... ] part of that is to short circuit when possible the
> >> somewhat hairy ''hint'' related logic in 'jfs_dmap.c' and
> >> that in 'jfs_open()' for example.
> shaggy> I don't understand the problem with the "hint". The
> shaggy> hints are used to attempt to allocated file data near
> shaggy> the inode, or to append onto existing extents when the
> shaggy> following blocks are available.
> Ah yes sure, but they have a hairy logic and they have
> conceivale performance limitations.
> My understanding of the current logic is to allocate small
> extents (the size of a 'write'(2) I guess), use the hint to put
> them near each other and then coalesce them if possible (if the
> hints worked that well).
Ideally, the size of the extents would be the size of the write. In
most cases, we are doing allocation a page at a time. If there is space
immediately after the previous extent, that extent is extended to
contain the new blocks, so there is no coalescing going on.
> But this is complex and has some
> limitations (which do not happen if one is doing an initial load
> of a filesystem, but otherwise may bite). It is designed to make
> the best of a bad lot, if one does not know in advance the end
> size of a file.
It is not as complex as preallocation. Even if we know in advance the
size of the file, we would have to make sure that unwritten pages are
zeroed, either by physically writing zero to disk, or punching holes in
the extent (either a real hole, or an ABNR extent).
> Allocating larger extents to start with and then cut that down
> on 'close' or whatever seems to me it could be a rather more
> successful if the free list (the buddy tree) is already a bit
> fragmented, and in any case handles several common cases where
> the end size is known in advance directly.
Yes, it could result in less fragmentation, but as I point out, it would
be more complex.
> Another way: the difference between attempting to allocate
> directly a 1MiB extent and something like 64x16KiB ones is
> that the first does a top-down scan of the buddy tree, the
> second does in effect a bottom up one, trying to find a 1MiB
> free block ''after the fact''. Even with hints, probably
> latter only works well if the free list is really almost
> unfragmented. Also, there may be several 1MiB free blocks in
> a somewhat fragmented free list, but it is random whether the
> 16KiB extents get allocated inside one, so that after-the-fact
> coalescing them back into the original 1Mib one might not
> work that well.
Hmm. Maybe there's a compromise. When doing allocations for file data,
jfs could search the binary buddy tree for an extent of a certain size
(say 1 MB), but continue to allocate as it does. That way a
sequentially-written file would grow contiguously into that space.
> [ ... ]
> shaggy> I think preallocation may be useful in some
> shaggy> circumstances, i.e. when a file is created
> shaggy> non-sequentially,
> I think that apart from non-sequential or parallel writes (and
> the AG switch helps in the latter case) or when the appends have
> already happened, the free list (buddy tree) is already
> fragemented (the the AG switch does not help).
> The general case for preallocation is made for example in this
> 'ext' paper:
This talks about the preallocation of large multimedia files, and how to
tell the file system to preallocate the file. If we allow some explicit
mechanism to preallocate a large file, I think we would have some
options. Maybe we could implement dense files and use ABNR extents in
some explicit cases. Again, if we have some way to know to begin a file
where there is a lot of free space, and can lock out other allocations,
we should get the desired results.
I'm less concerned about the slow-growing files that will be appended to
by different processes. I suspect that fragmentation of these files is
not a real big problem.
This talks of fragmentation due to concurrent allocations, and jfs does
tend to avoid that particular problem.
> But I think preallocation in the context of a buddy/extent based
> allocator and free list manager makes even more sense.
I guess I'm uncomfortable preallocating all the time, since it will lead
to more fragmentation. If every small file begins at a 1 MB offset,
we'll have lots of free space in between these small allocations.
> shaggy> but I am concerned that leaving preallocated, but
> shaggy> unused, blocks between actual file data
> But the unused bits would not be necessarily left there: because
> they would disappear on close, if any are left (that is the
> preallocation was overestimated, which for things like files
> written by 'tar' or 'gcc -c' would be impossible or rather rare)
> unless one marked the file as ''persistently preallocated'', in
> which case they would be e.g. for logs or virtual volume images
> ala DM/LVM2.
The freed space would only be available for metadata, since you propose
that any new file begin with a large allocation.
> shaggy> will result in more fragmentation, or just wasted space
> shaggy> on the disk.
> As to fragmentation, I suspect less, because of the particular
> aspects of a buddy system, which strongly favours keeping large
> blocks together so when they are deallocated they are whole
> related (thus splittable/coalesceable) subtrees. The wasted
> space can be minuscule if it is ''shrink-wrap on close'', or
> irrelevant if one sets something like a ''keep preallocation''
shrink-wrap on close: you propose moving the data then? We'd be better
off with allocation on close, rather than preallocation in this case.
> Then there is the ''DM/LVM2'' replacement story...
I don't want to go there. :^)
IBM Linux Technology Center