Thanks Tony, your points are well taken, and I appreciate the reading suggestions. I do think I will start with s/370, especially since everything I learn is fortunately compatible with modern systems. Even the me of 3 days ago sounds relatively ignorant to the me of today.
Through this dialogue with you and the other people who kindly responded to my questions, I have gotten a much better sense of what Hercules does, and the use of zlinux.
I had been reading about checkpointing and rollback, various types of memory error correction of soft and hard errors, failover cpu/memory, redundancy, latches etc. and was hoping to find an example of software implemented hardware fault tolerance in Hercules.
I still wonder if one could patch such a software processor into Hercules. I realize now that it would happen transparently to any OS, except perhaps for checkstopping.
> On 9 February 2011 14:10, hec.tor1 <hec.tor1@...> wrote:
> > I guess what I'd like to know is whether these kernel modules and tools take advantage of the processor features that have been implemented in Hercules. Will it run the OS and applications fault-tolerantly? Will there be memory checks for data integrity? Other checks for data integrity?
> Perhaps I'm not understanding your questions, but I think it should be
> clear that an emulator like Hercules does not attempt to derive fault
> tolerance in its emulated machine from underlying non-fault-tolerant
> hardware and software, such as the Intel machines it typically runs
> The real IBM hardware has a great deal of fault tolerance, and to
> exploit some of that requires help and support from the guest
> operating system. Hercules emulates some very small parts of those
> interfaces, but most of the faul tolerant stuff is not architected (or
> to be more accurate, is not part of the IBM published architecture).
> So while there are all sorts of interesting public documents out there
> on how IBM's machines work internally to avoid and/or recover from
> errors, there are no interface specs for much of it.
> Again, I'm not sure what you are interested in learning about, or what
> your technical background is. Certainly it won't hurt to read the
> Principles of Operation book(s), and if you are fairly new to this,
> then I'd start with the earlier ones, i.e. S/370, and then work up.
> It's hard to plunge into a 1000 page book, but the earlier 100ish page
> ones are doable.
> You might also want to look at some of the SHARE presentations by
> people like IBM's Bob Rogers. He concentrates mostly on performance,
> but occasionally touches on reliability and recovery. A recent one
> that is still not locked up is at
> http://share.confex.com/share/115/webprogram/Handout/Session7534/How%20do%20you%20do%20what%20you%20do%20when%20youre%20a%20z10%20CPU.pdf > and the even newer
> http://share.confex.com/share/116/webprogram/Handout/Session9063/How%20do%20you%20do%20when%20youre%20a%20z196%20CPU.pdf >
> There is also a related one by David Bond, of TachyonSoft fame:
> http://www.tachyonsoft.com/s8192db.pdf >
> And there have been some very good articles on much of this published
> over the years in the IBM Systems Journal, and the IBM Journal of
> Research & Development. Unfortunately IBM in its wisdom, stopped
> allowing free public access to these journals a year or more ago, but
> you can find them at university libraries, and in some cases online.
> Tony H.