« Return to Thread: [rvm-research] Looking for Sources of Performance Variation

Re: [rvm-research] Looking for Sources of Performance Variation

by Steve Blackburn :: Rate this Message:

Reply to Author | View in Thread

Hi Rhodes,

Thanks for providing your data.  I think this is an important subject  
and I'm glad the mailing list is discussing it! :-)

I took a look through your slides and read your post and have a few  
comments.  I also have some data that I gathered to measure the  
performance of our pending 3.1 release against 3.0.1.


o  I guess you know this, but to be clear: what you've seen has  
nothing to do with Jikes RVM per se, but rather, it is a property of  
modern high performance VMs.   The data below shows that production  
JVMs from Sun and IBM exhibit very similar characteristics.  You see  
similar data in publications, and you'll see it in another form here (http://dacapo.anu.edu.au/regression/perf/2006-10-MR2.html 
) where we continuously track performance for DaCapo, including warm-
up curves.   It is also important to acknowledge that this is not a  
deficiency in the workloads or in the VMs, but rather it reflects the  
way modern applications behave on modern VMs.  Above all, as I show  
below, varying the architecture is crucial.


o  Some of the problems you allude to have been very well thrashed out  
in the literature.  In particular, there's be a lot of work on  
methodology for measuring garbage collection meaningfully, perhaps  
less on methodology for JIT evaluation (and these are quite different  
problems).


o  We should be clear on terminology.  I believe standard terminology  
is that "invocation" is a JVM invocation, and "iteration" is an  
iteration of a benchmark (within a single JVM invocation).


o  I found it a bit hard to see exactly what points you are trying to  
make.  I'm guessing that you're thinking about some of the following:
        a. As a VM iterates over a benchmark, performance will typically  
improve to some asymptotic limit.
        b. Warmup curves for a single invocation do not exhibit monotonic  
improvement; performance goes up and down within a given invocation.
        c. For a given iteration of a given benchmark, performance for a  
given JVM will vary from invocation to invocation.


o   You say:

> Indeed, as the results from my slides above indicate, many of the
> benchmarks exhibit significant variability from run to run. It is
> possible to identify a statistically valid mean "best" score, but this
> often requires running well beyond the prescribed number of
> iterations

It's a bit unclear what you mean here.  It sounds a bit like  
conflation of points a) through c) above.  Further, it is important to  
understand that there is no "prescribed" number of iterations (or  
invocations).  As a researcher you need to design your experiment  
appropriately, and this means choosing such parameters sensibly, given  
your objectives, your context, and your constraints.

So I'm not quite sure what you are getting at.  In my previous post  
I'd mentioned off-hand using the 4th iteration and 20 invocations when  
measuring Jikes RVM's overall performance with the AOS turned on (the  
basis for the 4th iteration was that this is roughly the knee in the  
curve for Jikes RVM's warmup on DaCapo).   So I'm guessing part of  
what you say is in response to that.  To be sure, if I reduced the  
iteration count, then I would be further away from the asymptotically  
best performance.  On the other hand, increasing the iteration count  
would have the opposite effect.   The question is whether this matters  
or not.  The answer is entirely dependent on what it is that you want  
to show.  More importantly, you need to weigh the cost of further  
iterations against what else you could have done with your  
experimental budget (ie the "opportunity cost" of running those extra  
iterations).  You also need to consider what is  
"meaningful" (compiling the exact same program N times in a row is  
perhaps not particularly "meaningful", if it is a goal that the  
evaluation be somehow representative real world workloads---FWIW you  
may want to think about SPECjvm2008 in this light).


o  Your choice of hardware platform can have a dramatic affect, and  
will often dominate over other issues (such as asymptotic performance,  
as an example).   Some machines (such as the Pentium-D and its  
cousins ;-) are notoriously brittle.  In the case of the Pentium-D,  
famously the trace cache can sometimes lead to very counter-intuitive  
results, and more generally, the very deep pipe probably accentuates  
underlying noise.  Running your experiments on multiple machines is  
essential.  Just as an example, in the results below, for pre3.1, on  
antlr, the P4 results have an 95ci of 15.3% while the c2q has 3.5% (ie  
the P4 was 5 X more noisy than the C2Q on that particular benchmark).


o  A small nit with your graphs...  You need to include the origin if  
you don't have normalized data.  Otherwise your data looks very  
exaggerated.  This is a standard gotcha :-)  Either include the zero  
point, or change the y axis so it is normalized.


o  You may find interesting the data I've been gathering to compare  
our pending 3.1 release with 3.0.1.  Since I'd just read your post, I  
went and modified our scripts so that I can produce some warm-up data  
(and ran things through 32 iterations just for this experiment!).  The  
data below is all gathered over 32 iterations of each benchmark and 20  
invocations.  I show means and 95% confidence intervals (expressed as  
a percent of the result).  I've done the measurements on an i7, a core  
2 quad, a pentium 4 and an atom.  I'd normally include a PPC machine  
too, for an entirely different ISA, but that turned out not to be  
convenient when I set off the runs yesterday.  I've included numbers  
for Sun's HotSpot and IBM's J9, each with a stack of performance flags  
turned on (server mode, etc etc).   The data takes time to generate;  
the graphs are incrementally updated with new benchmarks as new data  
becomes available.

- The ostensive reason for these measurements was to compare 3.0.1  
against the pending release.  Compare p4, c2q and i7 results:
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.jikes.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7/bmtime.jikes.html

- Warm-up numbers for 4 different JVMs on the c2q (first bar is the  
final iteration, subsequent bars are warm-up iterations):
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikessvn-warmup.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes301-warmup.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.sun-warmup.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.ibm-warmup.html
        - Note that Jikes RVM's warm-up profile is fairly similar to hotspot
        - See the 95% CI numbers and see that when a give iteration is  
measure 20 times the result is fairly stable (even more so for later  
iterations).
        - If you look at the same graphs on the P4 you'll see far more  
variation

- Take a look at the 95% CI's across JVMs:
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.all.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.all.html
        - Jikes' variability is similar to each of the other JVMs.  The  
choice of platform is the biggest factor.

- One way to reduce noise is to turn off the adaptive optimization  
system.  I've done that here and forced everything to be O1 compiled:
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.aos.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.aos.html
        - Notice how much lower the 95% CI is, particularly on the noisy P4.

- You can browse the data further if you're interested:
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/atom 
  (data not online yet at time of writing)


o  There have been a quite a few interesting studies of these issues.  
Some particularly interesting work has come from Amer Diwan's group,  
his colleagues and (former) students.  Mike already pointed out one of  
their ASPLOS papers from this year.


o  My take home from all this:
        - There is no simple prescription.  You need to understand your  
system and your hypothesis, and carefully design the experiments to  
suit.
        - Consider the opportunity cost when making a decision: you don't  
have infinite resources, so if something offers diminishing returns  
you need to think very carefully whether your resources would be  
better spent running some other experiment, evaluating a new  
benchmark, etc.   I would never normally run 32 iterations.  I just  
did that this time out of interest.  :-)
        - Architecture (in fact the entire environment) really matters.    
Results from just one machine can be very misleading.
       

Thanks again for raising those interesting issues and sharing your  
slides with us.   The data is really interesting, and this is an  
important discussion to have.

Cheers,

--Steve


------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Jikesrvm-researchers mailing list
Jikesrvm-researchers@...
https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers

 « Return to Thread: [rvm-research] Looking for Sources of Performance Variation