Sparcstation 20 - Dude, we're getting the band back together!

View: New views
8 Messages — Rating Filter:   Alert me  

Sparcstation 20 - Dude, we're getting the band back together!

by Sanford Barton :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Guys, I've been having a blast (sorta), refurbishing an old Sparcstation 20.

So, finally, I can afford to put together that ultimate SS20 that I
lusted after back in 1995!! But I've run into some issues that I'm not
sure if they are normal or if I need to keep digging, so heres the
story:

I built this machine in an Aurora 2 chassis (the one with the better
cooling and full-height cd-rom).  I'm using a later rev motherboard
that has the newer sbus controller.  In the drive bay I have 1x 72gb
seagate cheetah (actually runs cooler than the older, smaller drives
of the era).  I have an 8mb VSIMM for the video and run Xsun in 24bpp,
and 448mb RAM. The PROM is 2.25R.

For processors I have a pair of SM71's and a pair of SM81's at my
disposal.  I also have a pair of ROSS 200Mhz and a pair of 150Mhz, but
I really, REALLY, would rather use the SuperSparcs in this machine
because they wipe the floor with the ROSS in general
desktop/multitasking usage.

I've loaded Solaris 9 and the latest recommended patch bundle as well
as several other patches for gtk apps, etc.

So here's the problem.  With the ROSS processors the above system is
rock solid, everything works as expected.

With either the 2 x SM71s or  2 x SM81s, the system runs fine until I
start the stress it with a few modern apps like Seamonkey,
Thunderbird, Pidgin, etc.  These newer apps put a lot of load on the
CPU's in the form of context switches and cache (which the SuperSparcs
excel at).  But after about 10 minutes of usage like this, the machine
locks up solid and must be power cycled.  An easy way to reproduce
this is to scroll a large webpage up and down continuously for about
20 seconds.  That really spikes the CPU usage and will always leads to
a lockup.  Like I said, with the ROSS processors, I could do that all
day long with no issues.

For testing purposes, the top of the case is removed and  supplemental
cooling on the cpu modules to produce a best-case baseline for heat
management.  I'm very familiar with the heat challenges in the aurora
cases, especially with the faster processors, but I would say as a
general finger test, the SuperSparcs are generally running a tad
cooler then the Hypersparcs, even after I reconditioned the heat sinks
with new thermal compound.  I have also applied new thermal compound
to the memory and subs controller heat sinks.

Things I think I can rule out:

-  Motherboard (tested with spare, no improvement)
-  Power supply (tested with spare,  no improvement)
-  Memory (tested with spares, no improvement)

So where I'm at is basically 4 possibilities:

1.)  I'm taxing the SuperSparcs in such a way with the newer software
that they will never be stable in this system.
2.)  I have a batch of bad processor modules.
3.)  The CG14 bits are being pushed beyond their limits with the
SuperSparcs, especially now that it's not using much acceleration for
2D (speculation).  If so, why not the same behavior with the ROSS
processors?  Perhaps they are not able to feed/stress the
memory/graphics controller they way the SuperSparcs are?
4.)  Not a heat or component stress issue at all, but some sort of
multitasking, OS, or cache/memory controller bug.


Any ideas, experiences, hope, discouragement, anything??

Thanks!
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by velociraptor :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Aug 26, 2009, at 9:33 AM, Sanford Barton wrote:

> Guys, I've been having a blast (sorta), refurbishing an old  
> Sparcstation 20.

Nice, old school system.

> 1.)  I'm taxing the SuperSparcs in such a way with the newer software
> that they will never be stable in this system.
> 2.)  I have a batch of bad processor modules.
> 3.)  The CG14 bits are being pushed beyond their limits with the
> SuperSparcs, especially now that it's not using much acceleration for
> 2D (speculation).  If so, why not the same behavior with the ROSS
> processors?  Perhaps they are not able to feed/stress the
> memory/graphics controller they way the SuperSparcs are?
> 4.)  Not a heat or component stress issue at all, but some sort of
> multitasking, OS, or cache/memory controller bug.
>
>
> Any ideas, experiences, hope, discouragement, anything??

I'd suggest testing headless.  E.g. Use a remote X session running  
your browser and try your lock-up trick.  This will at least narrow it  
down to OS bugs vs graphics sub-system.

If it is stable w/out the graphics, I'd grab Sun VTS (if possible--it  
requires a contract to download) and see what is in there for stress-
testing the graphic subsystem.  Maybe you have something marginal in  
there?  Maybe try to track down a later HW rev of your graphics card?

Good luck--
=Nadine=
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by Lionel Peterson-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

I believe VTS was free and included on install media, at least on on  
more mature releases of Solaris...

Has that changed?

Lionel

On Sep 5, 2009, at 12:09 PM, Nadine Miller <velociraptor@...>  
wrote:

> I'd grab Sun VTS (if possible--it requires a contract to download)
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by der Mouse-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

>> Sparcstation 20.

> Maybe try to track down a later HW rev of your graphics card?

The message said cg14, so this isn't really very possible.  There is no
graphics card as such; much of the cg14 is on the motherboard - and
what isn't is physically conflated with the VSIMM and I would guess
doesn't have all that many revs.  (Of course, I could well be wrong,
especially about the "I would guess...".)

My guess would be 2 (bad processors) or 4 (software issues, eg with
cache).  I've had cache issues with the SS20 myself; I have pseudo-disk
drivers for NetBSD that simply don't work on the 20 unless I add code
to "manually" force everything out of the cache at critical points.
Yes, this indicates bugs somewhere - but the same code "works fine" on
other sparc32s, so it indicates that the SS20 is relatively demanding
in regard of caches and such.  And as for the Ross difference, I've
seen it said that the Ross processors have tiny caches but more muscle;
if true, perhaps the same underlying problem exists but everything is
being pushed out of the cache before the trouble has a chance to
manifest?

/~\ The ASCII  Mouse
\ / Ribbon Campaign
 X  Against HTML mouse@...
/ \ Email!     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by Sanford Barton :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Guys thank you for the help.  I posted this to sunhelp a couple of
weeks ago and have been diligently working ever since.  This is the
current status:

I wanted to see if the issue persisted with Solaris 8, so I loaded a
fresh Solaris 8 2/02 + latest recommended. I had a hell of a time
getting the X11 install stable. There is a MAJOR XSun memory leak with
Solaris 8 on a SS20 using the CG14/SX graphics. Basically even with
all the available/latest XSun/kernel patches, the XSun process will
nibble your ram by about 2-3mb every few seconds until it starts
swapping, and then it's only a matter of time before your hosed. The
solution turned out to be using the
/usr/openwin/server/module/ddxSUNWcg14.so.1 from my Solaris 9 install
CD. That fixed the memoy leak.

So, everthing stable now using both SM81s on Solaris 8. Running all
the programs I want - Thunderbird, SeaMonkey(make sure you turn off
java/javascript), XChat, XMMS streaming a 56k mp3 stream, a few xterms
for a few hours now and no problems at all.

Amazed I can run all that and still be 85% idle on 2 85mhz processors
and STILL have 245M out of 448M free physical RAM.

So unless something else pops up, looks like mystery solved. Maybe one
day I can try Solaris 9 again, but I don't think I will unless I can
find a later HW release than the one I have.


On Sat, Sep 5, 2009 at 8:04 PM, der Mouse <mouse@...> wrote:

>>> Sparcstation 20.
>
>> Maybe try to track down a later HW rev of your graphics card?
>
> The message said cg14, so this isn't really very possible.  There is no
> graphics card as such; much of the cg14 is on the motherboard - and
> what isn't is physically conflated with the VSIMM and I would guess
> doesn't have all that many revs.  (Of course, I could well be wrong,
> especially about the "I would guess...".)
>
> My guess would be 2 (bad processors) or 4 (software issues, eg with
> cache).  I've had cache issues with the SS20 myself; I have pseudo-disk
> drivers for NetBSD that simply don't work on the 20 unless I add code
> to "manually" force everything out of the cache at critical points.
> Yes, this indicates bugs somewhere - but the same code "works fine" on
> other sparc32s, so it indicates that the SS20 is relatively demanding
> in regard of caches and such.  And as for the Ross difference, I've
> seen it said that the Ross processors have tiny caches but more muscle;
> if true, perhaps the same underlying problem exists but everything is
> being pushed out of the cache before the trouble has a chance to
> manifest?
>
> /~\ The ASCII                             Mouse
> \ / Ribbon Campaign
>  X  Against HTML                mouse@...
> / \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
> _______________________________________________
> rescue list - http://www.sunhelp.org/mailman/listinfo/rescue
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by Jonathan Katz-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Wow.

I wonder if this has to do with the various kernel changes between S8 and
S9. Not the MPO stuff (I don't think that applies to sun4m systems) but the
kernel's threading model changed from many-to-many to one-to-one(?) to eek
out more performance.

This shows how to enable one-to-one on Solaris 8. It may be worth a try to
do this to see if you can "break" your Solaris 8 setup.
http://www-01.ibm.com/support/docview.wss?rs=180&uid=swg21107291

More background info:
http://lkml.indiana.edu/hypermail/linux/kernel/0001.3/0238.html
http://www.j2ee.me/docs/hotspot/threads/threads.html
http://www.northco.net/chenke/project/solaris.html

On Sat, Sep 5, 2009 at 10:36 PM, Sanford Barton <xc68000@...> wrote:

> Guys thank you for the help.  I posted this to sunhelp a couple of
> weeks ago and have been diligently working ever since.  This is the
> current status:
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by Mike Shields-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> So where I'm at is basically 4 possibilities:
>
> 1.)  I'm taxing the SuperSparcs in such a way with the newer software
> that they will never be stable in this system.
> 2.)  I have a batch of bad processor modules.
> 3.)  The CG14 bits are being pushed beyond their limits with the
> SuperSparcs, especially now that it's not using much acceleration for
> 2D (speculation).  If so, why not the same behavior with the ROSS
> processors?  Perhaps they are not able to feed/stress the
> memory/graphics controller they way the SuperSparcs are?
> 4.)  Not a heat or component stress issue at all, but some sort of
> multitasking, OS, or cache/memory controller bug.
>
>
>
Is it possible that it's the Ross PROM? I've got several dual SM71 SS20's
that I've never had issues with. They all use PROM rev 2.22. I have two,
specifically, that run Solaris 9. One runs 24/7, although that one is maxed
out to 512mb and is headless (well, TGX on SBUS). I've got another one that
I built for desktop purposes, with 448mb ram, 8mb VSIMM, 36gb 10krpm drive,
etc. Again, I've never had any issues, running Gnome, web browsers, etc on
the CG14.

Since you've found a working system in Solaris 8, just consider this another
data point. If you're interested in further investigation, I could collect
revision numbers from various parts for comparison.
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue

Re: Sparcstation 20 - Dude, we're getting the band back together!

by Sanford Barton :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

@Jonathan - Thats a good idea and I'll give it a shot.

@Mike it feels like a software issue at this point, but I'd be
interested in what release of Solaris 9 you started with... I only
have the FCS Media Kit....I'm not convinced that the latest
recommended patch bundle brought everything forward that needed
updating. I doubt Sun was doing much patch regression testing on the
Sparc20 when 9 was getting updated ;)

As an aside - I got some great steals on some ROSS modules. So pretty
soon I'll be able to test and benchmark the following modules agains
each other:

-2 x ROSS 626D 200Mhz 512kb half-speed cache
-2 x ROSS 626C 150Mhz 512kb full-speed cache
-2 x ROSS 626x 142Mhz 1024kb full-speed cache (really interested in
how these perform)
-2 x SM71's
-2 x SM81's

The specInt/specFP values you see for all these processors are really
whacked when you bounce them agains real world tasks.  I'd like to
find a benchmarking package that would give a better representation of
various performance aspects of the above.  It's all for fun and games
at this stage of course :)

Thanks for all the input from everyone btw.
_______________________________________________
rescue list - http://www.sunhelp.org/mailman/listinfo/rescue