|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - 12 - 13 - 14 | Next > |
|
|
ext2/3: document conditions when reliable operation is possibleNot all block devices are suitable for all filesystems. In fact, some block devices are so broken that reliable operation is pretty much impossible. Document stuff ext2/ext3 needs for reliable operation. Signed-off-by: Pavel Machek <pavel@...> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..9c3d729 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortuantely, none of the cheap USB/SD flash cards I seen do + behave like this, and are unsuitable for all linux filesystems + I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite the next 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be neccessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..02a9bd5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,27 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleHi,
2009/3/12 Pavel Machek <pavel@...>: > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > index 4333e83..b09aa4c 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they > have to be 8 character filenames, even then we are fairly close to > running out of unique filenames. > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely Shouldn't this be "Ext2"? All the best, Jochen -- http://seehuhn.de/ -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> Not all block devices are suitable for all filesystems. In fact, some > block devices are so broken that reliable operation is pretty much > impossible. Document stuff ext2/ext3 needs for reliable operation. > > Signed-off-by: Pavel Machek <pavel@...> > > diff --git a/Documentation/filesystems/expectations.txt > b/Documentation/filesystems/expectations.txt new file mode 100644 > index 0000000..9c3d729 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,47 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly, because success > +on fsync was already returned when data hit the journal. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. I vaguely recall that the behavior of when a write error _does_ occur is to remount the filesystem read only? (Is this VFS or per-fs?) Is there any kind of hotplug event associated with this? I'm aware write errors shouldn't happen, and by the time they do it's too late to gracefully handle them, and all we can do is fail. So how do we fail? > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do I've seen > + behave like this, and are unsuitable for all linux filesystems "are thus unsuitable", perhaps? (Too pretentious? :) > + I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. Somebody corrected me, it's not "the next" it's "the surrounding". (Writes aren't always cleanly at the start of an erase block, so critical data _before_ what you touch is endangered too.) > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be neccessary; Necessary > + otherwise, disks may write garbage during powerfail. > + Not sure how common that problem is on generic PC machines. > + > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. These days instead of "atomic" it's better to think in terms of "barriers". Requesting a flush blocks until all the data written _before_ that point has made it to disk. This wait may be arbitrarily long on a busy system with lots of disk transactions happening in parallel (perhaps because Firefox decided to garbage collect and is spending the next 30 seconds swapping itself back in to do so). > + > + > diff --git a/Documentation/filesystems/ext2.txt > b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory > entries, so they have to be 8 character filenames, even then we are fairly > close to running out of unique filenames. > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: This paragraph talks about ext3... > +* write errors not allowed > + > +* sector writes are atomic > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* write caching is disabled. ext2 does not know how to issue barriers > + as of 2.6.28. hdparm -W0 disables it on SATA disks. And here we're talking about ext2. Does neither one know about write barriers, or does this just apply to ext2? (What about ext4?) Also I remember a historical problem that not all disks honor write barriers, because actual data integrity makes for horrible benchmark numbers. Dunno how current that is with SATA, Alan Cox would probably know. Rob -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleHi!
> > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > + > > + Fortunately writes failing are very uncommon on traditional > > + spinning disks, as they have spare sectors they use when write > > + fails. > > I vaguely recall that the behavior of when a write error _does_ occur is to > remount the filesystem read only? (Is this VFS or per-fs?) Per-fs. > Is there any kind of hotplug event associated with this? I don't think so. > I'm aware write errors shouldn't happen, and by the time they do it's too late > to gracefully handle them, and all we can do is fail. So how do we > fail? Well, even remount-ro may be too late, IIRC. > > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > > I've seen > > > + behave like this, and are unsuitable for all linux filesystems > > "are thus unsuitable", perhaps? (Too pretentious? :) ACK, thanks. > > + I know. > > + > > + An inherent problem with using flash as a normal block > > + device is that the flash erase size is bigger than > > + most filesystem sector sizes. So when you request a > > + write, it may erase and rewrite the next 64k, 128k, or > > + even a couple megabytes on the really _big_ ones. > > Somebody corrected me, it's not "the next" it's "the surrounding". Its "some" ... due to wear leveling logic. > (Writes aren't always cleanly at the start of an erase block, so critical data > _before_ what you touch is endangered too.) Well, flashes do remap, so it is actually "random blocks". > > + otherwise, disks may write garbage during powerfail. > > + Not sure how common that problem is on generic PC machines. > > + > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > + because it needs to write both changed data, and parity, to > > + different disks. > > These days instead of "atomic" it's better to think in terms of > "barriers". This is not about barriers (that should be different topic). Atomic write means that either whole sector is written, or nothing at all is written. Because raid5 needs to update both master data and parity at the same time, I don't think it can guarantee this during powerfail. > > +Requirements > > +* write errors not allowed > > + > > +* sector writes are atomic > > + > > +(see expectations.txt; note that most/all linux block-based > > +filesystems have similar expectations) > > + > > +* write caching is disabled. ext2 does not know how to issue barriers > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > And here we're talking about ext2. Does neither one know about write > barriers, or does this just apply to ext2? (What about ext4?) This document is about ext2. Ext3 can support barriers in 2.6.28. Someone else needs to write ext4 docs :-). > Also I remember a historical problem that not all disks honor write barriers, > because actual data integrity makes for horrible benchmark numbers. Dunno how > current that is with SATA, Alan Cox would probably know. Sounds like broken disk, then. We should blacklist those. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleUpdated version here.
On Thu 2009-03-12 14:13:03, Rob Landley wrote: > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote: > > Not all block devices are suitable for all filesystems. In fact, some > > block devices are so broken that reliable operation is pretty much > > impossible. Document stuff ext2/ext3 needs for reliable operation. > > > > Signed-off-by: Pavel Machek <pavel@...> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..710d119 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortunately, none of the cheap USB/SD flash cards I've seen + do behave like this, and are thus unsuitable for all Linux + filesystems I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite some 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 4333e83..41fd2ec 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..02a9bd5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,27 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> Updated version here. > > On Thu 2009-03-12 14:13:03, Rob Landley wrote: > > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote: > > > Not all block devices are suitable for all filesystems. In fact, some > > > block devices are so broken that reliable operation is pretty much > > > impossible. Document stuff ext2/ext3 needs for reliable operation. Some of what is here are bugs, and some are legitimate long-term interfaces (for example, the question of losing I/O errors when two processes are writing to the same file, or to a directory entry, and errors aren't or in some cases, can't, be reflected back to userspace). I'm a little concerned that some of this reads a bit too much like a rant (and I know Pavel was very frustrated when he tried to use a flash card with a sucky flash card socket) and it will get used the wrong way by apoligists, because it mixes areas where "we suck, we should do better", which a re bug reports, and "Posix or the underlying block device layer makes it hard", and simply states them as fundamental design requirements, when that's probably not true. There's a lot of work that we could do to make I/O errors get better reflected to userspace by fsync(). So state things as bald requirements I think goes a little too far IMHO. We can surely do better. > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..710d119 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly, because success > +on fsync was already returned when data hit the journal. The last half of this sentence "because success on fsync was already returned when data hit the journal", obviously doesn't apply to all filesystems, since some filesystems, like ext2, don't journal data. Even for ext3, it only applies in the case of data=journal mode. There are other issues here, such as fsync() only reports an I/O problem to one caller, and in some cases I/O errors aren't propagated up the storage stack. The latter is clearly just a bug that should be fixed; the former is more of an interface limitation. But you don't talk about in this section, and I think it would be good to have a more extended discussion about I/O errors when writing data blocks, and I/O errors writing metadata blocks, etc. > + > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. This requirement is not quite the same as what you discuss below. > + > + Unfortunately, none of the cheap USB/SD flash cards I've seen > + do behave like this, and are thus unsuitable for all Linux > + filesystems I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite some 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. The characteristic you descrive here is not an issue about whether the whole sector is either written or nothing happens to the data --- but rather, or at least in addition to that, there is also the issue that when a there is a flash card failure --- particularly one caused by a sucky flash card reader design causing the SD card to disconnect from the laptop in the middle of a write --- there may be "collateral damange"; that is, in addition to corrupting sector being writen, adjacent sectors might also end up getting list as an unfortunate side effect. So there are actually two desirable properties for a storage system to have; one is "don't damage the old data on a failed write"; and the other is "don't cause collateral damage to adjacent sectors on a failed write". > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + Not sure how common that problem is on generic PC machines. This problem is still relatively common, from what I can tell. And ext3 handles this surprisingly well at least in the catastrophic case of garbage getting written into the inode table, since the journal replay often will "repair" the garbage that was written into the filesystem metadata blocks. It won't do a bit of good for the data blocks, of course (unless you are using data=journal mode). But this means that in fact, ext3 is more resistant to suriving failures to the first problem (powerfail while writing can damage old data on a failed write) but fortunately, hard drives generally don't cause collateral damage on a failed write. Of course, there are some spectaular exemption to this rule --- a physical shock which causes the head to slam into a surface moving at 7200rpm can throw a lot of debris into the hard drive enclosure, causing loss to adjacent sectors. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> Hi! > > > + Fortunately writes failing are very uncommon on traditional > > > + spinning disks, as they have spare sectors they use when write > > > + fails. > > > > I vaguely recall that the behavior of when a write error _does_ occur is > > to remount the filesystem read only? (Is this VFS or per-fs?) > > Per-fs. Might be nice to note that in the doc. > > Is there any kind of hotplug event associated with this? > > I don't think so. There probably should be, but that's a separate issue. > > I'm aware write errors shouldn't happen, and by the time they do it's too > > late to gracefully handle them, and all we can do is fail. So how do we > > fail? > > Well, even remount-ro may be too late, IIRC. Care to elaborate? (When a filesystem is mounted RO, I'm not sure what happens to the pages that have already been dirtied...) > > (Writes aren't always cleanly at the start of an erase block, so critical > > data _before_ what you touch is endangered too.) > > Well, flashes do remap, so it is actually "random blocks". Fun. When "please do not turn of your playstation until game save completes" honestly seems like the best solution for making the technology reliable, something is wrong with the technology. I think I'll stick with rotating disks for now, thanks. > > > + otherwise, disks may write garbage during powerfail. > > > + Not sure how common that problem is on generic PC machines. > > > + > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > > + because it needs to write both changed data, and parity, to > > > + different disks. > > > > These days instead of "atomic" it's better to think in terms of > > "barriers". > > This is not about barriers (that should be different topic). Atomic > write means that either whole sector is written, or nothing at all is > written. Because raid5 needs to update both master data and parity at > the same time, I don't think it can guarantee this during powerfail. Good point, but I thought that's what journaling was for? I'm aware that any flash filesystem _must_ be journaled in order to work sanely, and must be able to view the underlying erase granularity down to the bare metal, through any remapping the hardware's doing. Possibly what's really needed is a "flash is weird" section, since flash filesystems can't be mounted on arbitrary block devices. Although an "-O erase_size=128" option so they _could_ would be nice. There's "mtdram" which seems to be the only remaining use for ram disks, but why there isn't an "mtdwrap" that works with arbitrary underlying block devices, I have no idea. (Layering it on top of a loopback device would be most useful.) > > > +Requirements > > > +* write errors not allowed > > > + > > > +* sector writes are atomic > > > + > > > +(see expectations.txt; note that most/all linux block-based > > > +filesystems have similar expectations) > > > + > > > +* write caching is disabled. ext2 does not know how to issue barriers > > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > > > And here we're talking about ext2. Does neither one know about write > > barriers, or does this just apply to ext2? (What about ext4?) > > This document is about ext2. Ext3 can support barriers in > 2.6.28. Someone else needs to write ext4 docs :-). > > > Also I remember a historical problem that not all disks honor write > > barriers, because actual data integrity makes for horrible benchmark > > numbers. Dunno how current that is with SATA, Alan Cox would probably > > know. > > Sounds like broken disk, then. We should blacklist those. It wasn't just one brand of disk cheating like that, and you'd have to ask him (or maybe Jens Axboe or somebody) whether the problem is still current. I've been off in embedded-land for a few years now... Rob -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> + Unfortunately, none of the cheap USB/SD flash cards I've seen > + do behave like this, and are thus unsuitable for all Linux > + filesystems I know. When you say Linux filesystems do you mean "filesystems originally designed on Linux" or do you mean "filesystems that Linux supports"? Additionally whatever the answer, people are going to need help answering the "which is the least bad?" question and saying what's not good without offering alternatives is only half helpful... People need to put SOMETHING on these cheap (and not quite so cheap) devices... The last recommendation I heard was that until btrfs/logfs/nilfs arrive people are best off sticking with FAT - http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that should be mentioned? > +* either write caching is disabled, or hw can do barriers and they are enabled. > + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). > + > + hdparm -I reports disk features. If you have "Native > + Command Queueing" is the feature you are looking for. The document makes it sound like nearly everything bar battery backed hardware RAIDed SCSI disks (with perfect firmware) is bad - is this the intent? -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@...> wrote:
<snip> > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > + behave like this, and are unsuitable for all linux filesystems > + I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. I had *assumed* that SSDs worked like: 1) write request comes in 2) new unused erase block area marked to hold the new data 3) updated data written to the previously unused erase block 4) mapping updated to replace the old erase block with the new one If it were done that way, a failure in the middle would just leave the SSD with the old data in it. If it is not done that way, then I can see your issue. (I love the potential performance of SSDs, but I'm beginning to hate the implementations and spec writing.) Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Monday 16 March 2009 14:40:57 Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > > + Unfortunately, none of the cheap USB/SD flash cards I've seen > > + do behave like this, and are thus unsuitable for all Linux > > + filesystems I know. > > When you say Linux filesystems do you mean "filesystems originally > designed on Linux" or do you mean "filesystems that Linux supports"? > Additionally whatever the answer, people are going to need help > answering the "which is the least bad?" question and saying what's not > good without offering alternatives is only half helpful... People need > to put SOMETHING on these cheap (and not quite so cheap) devices... The > last recommendation I heard was that until btrfs/logfs/nilfs arrive > people are best off sticking with FAT - > http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that > should be mentioned? Actually, the best filesystem for USB flash devices is probably UDF. (Yes, the DVD filesystem turns out to be writeable if you put it on a writeable media. The ISO spec requires write support, so any OS that supports DVDs also supports this.) The reasons for this are: A) It's the only filesystem other than FAT that's supported out of the box by windows, mac, _and_ Linux for hotpluggable media. B) It doesn't have the horrible limitations of FAT (such as a max filesize of 2 gigabytes). C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over patents on it. However, when it comes to cutting the power on a mounted filesystem (either by yanking the device or powering off the machine) without losing your data, without warning, they all suck horribly. If you yank a USB flash disk in the middle of a write, and the device has decided to wipe a 2 megabyte erase sector that's behind a layer of wear levelling and thus consists of a series of random sectors scattered all over the disk, you're screwed no matter what filesystem you use. You know the vinyl "record scratch" sound? Imagine that, on a digital level. Bad Things Happen to the hardware, cannot compensate in software. > > +* either write caching is disabled, or hw can do barriers and they are > > enabled. + > > + (Note that barriers are disabled by default, use "barrier=1" > > + mount option after making sure hw can support them). > > + > > + hdparm -I reports disk features. If you have "Native > > + Command Queueing" is the feature you are looking for. > > The document makes it sound like nearly everything bar battery backed > hardware RAIDed SCSI disks (with perfect firmware) is bad - is this > the intent? SCSI disks? They still make those? Everything fails, it's just a question of how. Rotational media combined with journaling at least fails in fairly understandable ways, so ext3 on sata is reasonable. Flash gets into trouble when it presents the _interface_ of rotational media (a USB block device with normal 512 byte read/write sectors, which never wear out) which doesn't match what the hardware's actually doing (erase block sizes of up to several megabytes at a time, hidden behind a block remapping layer for wear leveling). For devices that have built in flash that DON'T pretend to be a conventional block device, but instead expose their flash erase granularity and let the OS do the wear levelling itself, we have special flash filesystems that can be reasonably reliable. It's just that ext3 isn't one of them, jffs2 and ubifs and logfs are. The problem with these flash filesystems is they ONLY work on flash, if you want to mount them on something other than flash you need something like a loopback interface to make a normal block device pretend to be flash. (We've got a ramdisk driver called "mtdram" that does this, but nobody's bothered to write a generic wrapper for a normal block device you can wrap over the loopback driver.) Unfortunately, when it comes to USB flash (the most common type), the USB standard defines a way for a USB device to provide a normal block disk interface as if it was rotational media. It does NOT provide a way to expose the flash erase granularity, or a way for the operating system to disable any built-in wear levelling (which is needed because windows doesn't _do_ wear levelling, and thus burns out the administrative sectors of the disk really fast while the rest of the disk is still fine unless the hardware wear-levels for it). So every USB flash disk pretends to be a normal disk, which it isn't, and Linux can't _disable_ this emulation. Which brings us back to UDF as the least sucky alternative. (Although the UDF tools kind of suck. If you reformat a FAT disk as UDF with mkudffs, it'll still be autodetected as FAT because it won't overwrite the FAT root directory. You have to blank the first 64k by hand with dd. Sad, isn't it?) Rob -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon 2009-03-16 15:45:36, Greg Freemyer wrote:
> On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@...> wrote: > <snip> > > +Sector writes are atomic (ATOMIC-SECTORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > + > > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > > + behave like this, and are unsuitable for all linux filesystems > > + I know. > > + > > + An inherent problem with using flash as a normal block > > + device is that the flash erase size is bigger than > > + most filesystem sector sizes. So when you request a > > + write, it may erase and rewrite the next 64k, 128k, or > > + even a couple megabytes on the really _big_ ones. > > + > > + If you lose power in the middle of that, filesystem > > + won't notice that data in the "sectors" _around_ the > > + one your were trying to write to got trashed. > > I had *assumed* that SSDs worked like: > > 1) write request comes in > 2) new unused erase block area marked to hold the new data > 3) updated data written to the previously unused erase block > 4) mapping updated to replace the old erase block with the new one > > If it were done that way, a failure in the middle would just leave the > SSD with the old data in it. The really expensive ones (Intel SSD) apparently work like that, but I never seen one of those. USB sticks and SD cards I tried behave like I described above. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon, Mar 16, 2009 at 5:43 PM, Rob Landley <rob@...> wrote:
> Flash gets into trouble when it presents the _interface_ of rotational media > (a USB block device with normal 512 byte read/write sectors, which never wear > out) which doesn't match what the hardware's actually doing (erase block sizes > of up to several megabytes at a time, hidden behind a block remapping layer > for wear leveling). > > For devices that have built in flash that DON'T pretend to be a conventional > block device, but instead expose their flash erase granularity and let the OS > do the wear levelling itself, we have special flash filesystems that can be > reasonably reliable. It's just that ext3 isn't one of them, jffs2 and ubifs > and logfs are. The problem with these flash filesystems is they ONLY work on > flash, if you want to mount them on something other than flash you need > something like a loopback interface to make a normal block device pretend to > be flash. (We've got a ramdisk driver called "mtdram" that does this, but > nobody's bothered to write a generic wrapper for a normal block device you can > wrap over the loopback driver.) The really nice SSDs actually reserve ~15-30% of their internal block-level storage and actually run their own log-structured virtual disk in hardware. From what I understand the Intel SSDs are that way. Real-time garbage collection is tricky, but if you require (for example) a max of ~80% utilization then you can provide good latency and bandwidth guarantees. There's usually something like a log-structured virtual-to-physical sector map as well. If designed properly with automatic hardware checksumming, such a system can actually provide atomic writes and barriers with virtually no impact on performance. With firmware-level hardware knowledge and the ability to perform extremely efficient parallel reads of flash blocks, such a log-structured virtual block device can be many times more efficient than a general purpose OS running a log-structured filesystem. The result is that for an ordinary ext3-esque filesystem with 4k blocks you can treat the SSD as though it is an atomic-write seek-less block device. Now if only I had the spare cash to go out and buy one of the shiny Intel ones for my laptop... :-) Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Thu 2009-03-12 11:40:52, Jochen Voß wrote:
> Hi, > > 2009/3/12 Pavel Machek <pavel@...>: > > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > > index 4333e83..b09aa4c 100644 > > --- a/Documentation/filesystems/ext2.txt > > +++ b/Documentation/filesystems/ext2.txt > > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they > > have to be 8 character filenames, even then we are fairly close to > > running out of unique filenames. > > > > +Requirements > > +============ > > + > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > ^^^^ > Shouldn't this be "Ext2"? Thanks, fixed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote: > > Hi! > > > > + Fortunately writes failing are very uncommon on traditional > > > > + spinning disks, as they have spare sectors they use when write > > > > + fails. > > > > > > I vaguely recall that the behavior of when a write error _does_ occur is > > > to remount the filesystem read only? (Is this VFS or per-fs?) > > > > Per-fs. > > Might be nice to note that in the doc. Ok, can you suggest a patch? I believe remount-ro is already documented ... somewhere :-). > > > I'm aware write errors shouldn't happen, and by the time they do it's too > > > late to gracefully handle them, and all we can do is fail. So how do we > > > fail? > > > > Well, even remount-ro may be too late, IIRC. > > Care to elaborate? (When a filesystem is mounted RO, I'm not sure what > happens to the pages that have already been dirtied...) Well, fsync() error reporting does not really work properly, but I guess it will save you for the remount-ro case. So the data will be in the journal, but it will be impossible to replay it... > > > (Writes aren't always cleanly at the start of an erase block, so critical > > > data _before_ what you touch is endangered too.) > > > > Well, flashes do remap, so it is actually "random blocks". > > Fun. Yes. > > > > + otherwise, disks may write garbage during powerfail. > > > > + Not sure how common that problem is on generic PC machines. > > > > + > > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > > > + because it needs to write both changed data, and parity, to > > > > + different disks. > > > > > > These days instead of "atomic" it's better to think in terms of > > > "barriers". > > > > This is not about barriers (that should be different topic). Atomic > > write means that either whole sector is written, or nothing at all is > > written. Because raid5 needs to update both master data and parity at > > the same time, I don't think it can guarantee this during powerfail. > > Good point, but I thought that's what journaling was for? I believe journaling operates on assumption that "either whole sector is written, or nothing at all is written". > I'm aware that any flash filesystem _must_ be journaled in order to work > sanely, and must be able to view the underlying erase granularity down to the > bare metal, through any remapping the hardware's doing. Possibly what's > really needed is a "flash is weird" section, since flash filesystems can't be > mounted on arbitrary block devices. > Although an "-O erase_size=128" option so they _could_ would be nice. There's > "mtdram" which seems to be the only remaining use for ram disks, but why there > isn't an "mtdwrap" that works with arbitrary underlying block devices, I have > no idea. (Layering it on top of a loopback device would be most > useful.) I don't think that works. Compactflash (etc) cards basically randomly remap the data, so you can't really run flash filesystem over compactflash/usb/SD card -- you don't know the details of remapping. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleOn Mon 2009-03-16 19:40:57, Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > > + Unfortunately, none of the cheap USB/SD flash cards I've seen > > + do behave like this, and are thus unsuitable for all Linux > > + filesystems I know. > > When you say Linux filesystems do you mean "filesystems originally > designed on Linux" or do you mean "filesystems that Linux supports"? "Linux filesystems I know" :-). No filesystem that Linux supports, AFAICT. > Additionally whatever the answer, people are going to need help > answering the "which is the least bad?" question and saying what's not > good without offering alternatives is only half helpful... People need > to put SOMETHING on these cheap (and not quite so cheap) > devices... The According to me, people should just AVOID those devices. I don't plan to point the "least bad"; its still bad. > > + hdparm -I reports disk features. If you have "Native > > + Command Queueing" is the feature you are looking for. > > The document makes it sound like nearly everything bar battery backed > hardware RAIDed SCSI disks (with perfect firmware) is bad - is this > the intent? Battery backed RAID should be ok, as should be plain single SATA drive. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleHi!
> > > > Not all block devices are suitable for all filesystems. In fact, some > > > > block devices are so broken that reliable operation is pretty much > > > > impossible. Document stuff ext2/ext3 needs for reliable operation. > > Some of what is here are bugs, and some are legitimate long-term > interfaces (for example, the question of losing I/O errors when two > processes are writing to the same file, or to a directory entry, and > errors aren't or in some cases, can't, be reflected back to > userspace). Well, I guess there's thin line between error and "legitimate long-term interfaces". I still believe that fsync() is broken by design. > I'm a little concerned that some of this reads a bit too much like a > rant (and I know Pavel was very frustrated when he tried to use a > flash card with a sucky flash card socket) and it will get used the It started as a rant, obviously I'd like to get away from that and get it into suitable format for inclusion. (Not being native speaker does not help here). But I do believe that we should get this documented; many common storage subsystems are broken, and can cause data loss. We should at least tell to the users. > wrong way by apoligists, because it mixes areas where "we suck, we > should do better", which a re bug reports, and "Posix or the > underlying block device layer makes it hard", and simply states them > as fundamental design requirements, when that's probably not true. Well, I guess that can be refined later. Heck, I'm not able to tell which are simple bugs likely to be fixed soon, and which are fundamental issues that are unlikely to be fixed sooner than 2030. I guess it is fair to document them ASAP, and then fix those that can be fixed... > There's a lot of work that we could do to make I/O errors get better > reflected to userspace by fsync(). So state things as bald > requirements I think goes a little too far IMHO. We can surely do > better. If the fsync() can be fixed... that would be great. But I'm not sure how easy that will be. > > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > The last half of this sentence "because success on fsync was already > returned when data hit the journal", obviously doesn't apply to all > filesystems, since some filesystems, like ext2, don't journal data. > Even for ext3, it only applies in the case of data=journal mode. Ok, I removed the explanation. > There are other issues here, such as fsync() only reports an I/O > problem to one caller, and in some cases I/O errors aren't propagated > up the storage stack. The latter is clearly just a bug that should be > fixed; the former is more of an interface limitation. But you don't > talk about in this section, and I think it would be good to have a > more extended discussion about I/O errors when writing data blocks, > and I/O errors writing metadata blocks, etc. Could you write a paragraph or two? > > + > > +Sector writes are atomic (ATOMIC-SECTORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > This requirement is not quite the same as what you discuss below. Ok, you are right. Fixed. > So there are actually two desirable properties for a storage system to > have; one is "don't damage the old data on a failed write"; and the > other is "don't cause collateral damage to adjacent sectors on a > failed write". Thanks, its indeed clearer that way. I split those in two. > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + Not sure how common that problem is on generic PC machines. > > This problem is still relatively common, from what I can tell. And > ext3 handles this surprisingly well at least in the catastrophic case > of garbage getting written into the inode table, since the journal > replay often will "repair" the garbage that was written into the ... Ok, added to ext3 specific section. New version is attached. Feel free to help here; my goal is to get this documented, I'm not particulary attached to wording etc... Signed-off-by: Pavel Machek <pavel@...> Pavel diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..0de456d --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,49 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 2344855..ee88467 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index e5f3833..6de8af4 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,45 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + + (Thrash may get written into sectors during powerfail. And + ext3 handles this surprisingly well at least in the + catastrophic case of garbage getting written into the inode + table, since the journal replay often will "repair" the + garbage that was written into the filesystem metadata blocks. + It won't do a bit of good for the data blocks, of course + (unless you are using data=journal mode). But this means that + in fact, ext3 is more resistant to suriving failures to the + first problem (powerfail while writing can damage old data on + a failed write) but fortunately, hard drives generally don't + cause collateral damage on a failed write. + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possiblePavel Machek <pavel@...> writes:
> On Mon 2009-03-16 14:26:23, Rob Landley wrote: >> On Monday 16 March 2009 07:28:47 Pavel Machek wrote: >> > > > + otherwise, disks may write garbage during powerfail. >> > > > + Not sure how common that problem is on generic PC machines. >> > > > + >> > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, >> > > > + because it needs to write both changed data, and parity, to >> > > > + different disks. >> > > >> > > These days instead of "atomic" it's better to think in terms of >> > > "barriers". Would be nice to have barriers in md and dm. >> > This is not about barriers (that should be different topic). Atomic >> > write means that either whole sector is written, or nothing at all is >> > written. Because raid5 needs to update both master data and parity at >> > the same time, I don't think it can guarantee this during powerfail. Actualy raid5 should have no problem with a power failure during normal operations of the raid. The parity block should get marked out of sync, then the new data block should be written, then the new parity block and then the parity block should be flaged in sync. >> Good point, but I thought that's what journaling was for? > > I believe journaling operates on assumption that "either whole sector > is written, or nothing at all is written". The real problem comes in degraded mode. In that case the data block (if present) and parity block must be written at the same time atomically. If the system crashes after writing one but before writing the other then the data block on the missng drive changes its contents. And for example with a chunk size of 1MB and 16 disks that could be 15MB away from the block you actualy do change. And you can not recover that after a crash as you need both the original and changed contents of the block. So writing one sector has the risk of corrupting another (for the FS) totally unconnected sector. No amount of journaling will help there. The raid5 would need to do journaling or use battery backed cache. MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: ext2/3: document conditions when reliable operation is possibleHi!
> >> > This is not about barriers (that should be different topic). Atomic > >> > write means that either whole sector is written, or nothing at all is > >> > written. Because raid5 needs to update both master data and parity at > >> > the same time, I don't think it can guarantee this during powerfail. > > Actualy raid5 should have no problem with a power failure during > normal operations of the raid. The parity block should get marked out > of sync, then the new data block should be written, then the new > parity block and then the parity block should be flaged in sync. > > >> Good point, but I thought that's what journaling was for? > > > > I believe journaling operates on assumption that "either whole sector > > is written, or nothing at all is written". > > The real problem comes in degraded mode. In that case the data block > (if present) and parity block must be written at the same time > atomically. If the system crashes after writing one but before writing > the other then the data block on the missng drive changes its > contents. And for example with a chunk size of 1MB and 16 disks that > could be 15MB away from the block you actualy do change. And you can > not recover that after a crash as you need both the original and > changed contents of the block. > > So writing one sector has the risk of corrupting another (for the FS) > totally unconnected sector. No amount of journaling will help > there. The raid5 would need to do journaling or use battery backed > cache. Thanks, I updated my notes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
[patch] ext2/3: document conditions when reliable operation is possibleRunning journaling filesystem such as ext3 over flashdisk or degraded RAID array is a bad idea: journaling guarantees no longer apply and you will get data corruption on powerfail. We can't solve it easily, but we should certainly warn the users. I actually lost data because I did not understand these limitations... Signed-off-by: Pavel Machek <pavel@...> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..80fa886 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,52 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + RAID-4/5/6 in degraded mode has same problem. + + +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. (But it will only really show up in degraded mode). + UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..0a9b87f 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 570f9bd..2ce82a3 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -199,6 +202,47 @@ debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + + (Thrash may get written into sectors during powerfail. And + ext3 handles this surprisingly well at least in the + catastrophic case of garbage getting written into the inode + table, since the journal replay often will "repair" the + garbage that was written into the filesystem metadata blocks. + It won't do a bit of good for the data blocks, of course + (unless you are using data=journal mode). But this means that + in fact, ext3 is more resistant to suriving failures to the + first problem (powerfail while writing can damage old data on + a failed write) but fortunately, hard drives generally don't + cause collateral damage on a failed write. + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. + + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
|
|
Re: [patch] ext2/3: document conditions when reliable operation is possible* Pavel Machek:
> +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. You should make clear that the file lists per-file-system rules and that some file sytems can recover from some of the error conditions. > +* don't damage the old data on a failed write (ATOMIC-WRITES) > + > + (Thrash may get written into sectors during powerfail. And > + ext3 handles this surprisingly well at least in the > + catastrophic case of garbage getting written into the inode > + table, since the journal replay often will "repair" the > + garbage that was written into the filesystem metadata blocks. Isn't this by design? In other words, if the metadata doesn't survive non-atomic writes, wouldn't it be an ext3 bug? -- Florian Weimer <fweimer@...> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@... More majordomo info at http://vger.kernel.org/majordomo-info.html |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - 12 - 13 - 14 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |