Hi Rhodes,
Thanks for providing your data. I think this is an important subject
and I'm glad the mailing list is discussing it! :-)
I took a look through your slides and read your post and have a few
comments. I also have some data that I gathered to measure the
performance of our pending 3.1 release against 3.0.1.
o I guess you know this, but to be clear: what you've seen has
nothing to do with Jikes RVM per se, but rather, it is a property of
modern high performance VMs. The data below shows that production
JVMs from Sun and IBM exhibit very similar characteristics. You see
similar data in publications, and you'll see it in another form here (
http://dacapo.anu.edu.au/regression/perf/2006-10-MR2.html
) where we continuously track performance for DaCapo, including warm-
up curves. It is also important to acknowledge that this is not a
deficiency in the workloads or in the VMs, but rather it reflects the
way modern applications behave on modern VMs. Above all, as I show
below, varying the architecture is crucial.
o Some of the problems you allude to have been very well thrashed out
in the literature. In particular, there's be a lot of work on
methodology for measuring garbage collection meaningfully, perhaps
less on methodology for JIT evaluation (and these are quite different
problems).
o We should be clear on terminology. I believe standard terminology
is that "invocation" is a JVM invocation, and "iteration" is an
iteration of a benchmark (within a single JVM invocation).
o I found it a bit hard to see exactly what points you are trying to
make. I'm guessing that you're thinking about some of the following:
a. As a VM iterates over a benchmark, performance will typically
improve to some asymptotic limit.
b. Warmup curves for a single invocation do not exhibit monotonic
improvement; performance goes up and down within a given invocation.
c. For a given iteration of a given benchmark, performance for a
given JVM will vary from invocation to invocation.
o You say:
> Indeed, as the results from my slides above indicate, many of the
> benchmarks exhibit significant variability from run to run. It is
> possible to identify a statistically valid mean "best" score, but this
> often requires running well beyond the prescribed number of
> iterations
It's a bit unclear what you mean here. It sounds a bit like
conflation of points a) through c) above. Further, it is important to
understand that there is no "prescribed" number of iterations (or
invocations). As a researcher you need to design your experiment
appropriately, and this means choosing such parameters sensibly, given
your objectives, your context, and your constraints.
So I'm not quite sure what you are getting at. In my previous post
I'd mentioned off-hand using the 4th iteration and 20 invocations when
measuring Jikes RVM's overall performance with the AOS turned on (the
basis for the 4th iteration was that this is roughly the knee in the
curve for Jikes RVM's warmup on DaCapo). So I'm guessing part of
what you say is in response to that. To be sure, if I reduced the
iteration count, then I would be further away from the asymptotically
best performance. On the other hand, increasing the iteration count
would have the opposite effect. The question is whether this matters
or not. The answer is entirely dependent on what it is that you want
to show. More importantly, you need to weigh the cost of further
iterations against what else you could have done with your
experimental budget (ie the "opportunity cost" of running those extra
iterations). You also need to consider what is
"meaningful" (compiling the exact same program N times in a row is
perhaps not particularly "meaningful", if it is a goal that the
evaluation be somehow representative real world workloads---FWIW you
may want to think about SPECjvm2008 in this light).
o Your choice of hardware platform can have a dramatic affect, and
will often dominate over other issues (such as asymptotic performance,
as an example). Some machines (such as the Pentium-D and its
cousins ;-) are notoriously brittle. In the case of the Pentium-D,
famously the trace cache can sometimes lead to very counter-intuitive
results, and more generally, the very deep pipe probably accentuates
underlying noise. Running your experiments on multiple machines is
essential. Just as an example, in the results below, for pre3.1, on
antlr, the P4 results have an 95ci of 15.3% while the c2q has 3.5% (ie
the P4 was 5 X more noisy than the C2Q on that particular benchmark).
o A small nit with your graphs... You need to include the origin if
you don't have normalized data. Otherwise your data looks very
exaggerated. This is a standard gotcha :-) Either include the zero
point, or change the y axis so it is normalized.
o You may find interesting the data I've been gathering to compare
our pending 3.1 release with 3.0.1. Since I'd just read your post, I
went and modified our scripts so that I can produce some warm-up data
(and ran things through 32 iterations just for this experiment!). The
data below is all gathered over 32 iterations of each benchmark and 20
invocations. I show means and 95% confidence intervals (expressed as
a percent of the result). I've done the measurements on an i7, a core
2 quad, a pentium 4 and an atom. I'd normally include a PPC machine
too, for an entirely different ISA, but that turned out not to be
convenient when I set off the runs yesterday. I've included numbers
for Sun's HotSpot and IBM's J9, each with a stack of performance flags
turned on (server mode, etc etc). The data takes time to generate;
the graphs are incrementally updated with new benchmarks as new data
becomes available.
- The ostensive reason for these measurements was to compare 3.0.1
against the pending release. Compare p4, c2q and i7 results:
http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.jikes.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7/bmtime.jikes.html- Warm-up numbers for 4 different JVMs on the c2q (first bar is the
final iteration, subsequent bars are warm-up iterations):
http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikessvn-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes301-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.sun-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.ibm-warmup.html - Note that Jikes RVM's warm-up profile is fairly similar to hotspot
- See the 95% CI numbers and see that when a give iteration is
measure 20 times the result is fairly stable (even more so for later
iterations).
- If you look at the same graphs on the P4 you'll see far more
variation
- Take a look at the 95% CI's across JVMs:
http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.all.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.all.html - Jikes' variability is similar to each of the other JVMs. The
choice of platform is the biggest factor.
- One way to reduce noise is to turn off the adaptive optimization
system. I've done that here and forced everything to be O1 compiled:
http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.aos.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.aos.html - Notice how much lower the 95% CI is, particularly on the noisy P4.
- You can browse the data further if you're interested:
http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4 http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7 http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/atom
(data not online yet at time of writing)
o There have been a quite a few interesting studies of these issues.
Some particularly interesting work has come from Amer Diwan's group,
his colleagues and (former) students. Mike already pointed out one of
their ASPLOS papers from this year.
o My take home from all this:
- There is no simple prescription. You need to understand your
system and your hypothesis, and carefully design the experiments to
suit.
- Consider the opportunity cost when making a decision: you don't
have infinite resources, so if something offers diminishing returns
you need to think very carefully whether your resources would be
better spent running some other experiment, evaluating a new
benchmark, etc. I would never normally run 32 iterations. I just
did that this time out of interest. :-)
- Architecture (in fact the entire environment) really matters.
Results from just one machine can be very misleading.
Thanks again for raising those interesting issues and sharing your
slides with us. The data is really interesting, and this is an
important discussion to have.
Cheers,
--Steve
------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image
processing features enabled.
http://p.sf.net/sfu/kodak-com_______________________________________________
Jikesrvm-researchers mailing list
Jikesrvm-researchers@...
https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers