« Return to Thread: [rvm-research] Looking for Sources of Performance Variation

Re: [rvm-research] Looking for Sources of Performance Variation

by Steve Blackburn :: Rate this Message:

Reply to Author | View in Thread

On 16/05/2009, at 10:25 AM, Rhodes Brown wrote:

> My primary intention was to debunk the myth of convergence.


I guess I'm not quite sure what the "myth of convergence" is.   I  
think many, if not most, people are aware that performance of a JVM  
does not always converge to some tightly bounded point within any  
single invocation.   More broadly, the idea of chaotic behavior is  
fairly well established.  Eliot Moss has been describing JVM  
performance in exactly those terms ("chaotic behavior") since I was  
doing a postdoc about 10 years ago.  We took this pretty seriously in  
the context of GC research, because we observed that small  
perturbations in mutator behavior often manifest as huge swings in GC  
performance.   Amer Diwan's group has also looked at this a lot and  
have gone further to note chaotic behavior at the hardware level [1].

I made this point in my previous email:

> b. Warmup curves for a single invocation do not exhibit monotonic
> improvement; performance goes up and down within a given invocation.

I think most people reading this list would agree that observation is  
pretty unremarkable.   This is one of the reasons why we take means  
across a significant number of invocations.  I have not used the per-
invocation convergence tools provided by harnesses such as SPEC and  
DaCapo since that approach is not meaningful in the context of what I  
normally measure (though I assume they are useful to some people).  So  
perhaps there's some debunking to be done surrounding the use of such  
tools.  I don't know.   If that's what you're thinking, then I  
recommend you come at with a working alternative in hand.

Let's look at one of the alternatives approaches: taking the time for  
a given iteration and then averaging that over multiple invocations.    
This is the approach I used in the data I pointed to in the last  
post.  The opaque name "bmtime" just referred to the time the  
benchmark reported on its final (32nd) iteration.  The other pages  
showed warmup data; times for each iteration.

In the case where replay compilation is used (as in our GC work), this  
is fairly straightforward.   For an AOS, there are at least two  
questions: a) which iteration/s to time, and b) whether or not one can  
assume that the average (cross invocation) performance curve is  
monotonic.   If the answer to b) is yes, then it is fairly easy to  
decide what to do (depending on what you're measuring).   If the  
answer to b) is no, then there are at least two conclusions: one  
should try to understand why the systemic perturbations arise, and if  
one cannot remove them, one should mitigate against them  
analytically.  Garbage collection is one such source of systemic  
perturbation, which is one of the reasons why we advocate measuring  
multiple heap sizes.

Clearly if you present results that average particular iterations  
across invocations, one must choose the iterations carefully.   You  
may have noticed that in the jikes rvm 6-hourly performance  
regressions [2], we report 1st, 3rd and 10th iterations, as well as  
numbers for both generous and "tight" heaps.   Now this is not deeply  
principled, but it does give you _some_ insight into the startup  
overhead, the steady state performance, the rate of convergence, and  
the effect of heap pressure.  On the dacapo web site [3] we show the  
same three results and additionally the warmup curves.   I decided to  
include the warmup curves because I believed in point b) above and  
think it is important to see how each of the systems is warming up.

So, if steady state and warmup are important to what it is that you're  
measuring, it is pretty clear that you should explicitly measure (and  
plot) that.  Which is why we do.  And it is clear from those curves  
that some benchmarks and some VMs have particularly chaotic behavior.

Finally, just as I've struggled a bit to nail down what problem it is  
you're pointing to, I'm not quite sure what it is you're proposing by  
way of a solution.


A few other minor notes:

o You might find this data interesting.  Here I measure 49 iterations  
(I only plot every 5th for the sake of space, but can add them all in  
if anyone really wants it), on both hotspot and jikes rvm.   One thing  
to note is that this data is pretty smooth and I don't think there are  
any examples of iteration-to-iteration variance that go outside the  
expected monotonic convergence curve by an amount that is greater than  
the 95% CI of the measured point.   If there are, then that's quite  
interesting (I didn't spot any).

        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7wu/bmtime.jikes-warmup.html
        http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7wu/bmtime.sun-warmup.html

o  In deference to the differing opinions on how to aggregate data, I  
generally include both arithmetic and geometric mean when I publish.  
I also include min and max results---these often get lost and are  
sometimes the most important information.  I've recently started  
taking this further and including (on my web page) all of the raw and  
tabulated data so that other researchers can scrutinize it at will.

o  You make the comment that it is debatable whether benchmarks  
(including dacapo) are representative of real world workloads.  Well  
yes.  No suite can be perfect.  However, the dacapo suite explicitly  
_trys_ to do this by trying to use unmodified source of widely used  
java programs.  Since dacapo is open source, the onus is on you and  
other researchers to provide concrete feedback, better yet, to propose  
better workloads and contribute source.   It is only through  
contributions such as this that the workload stays live and lives up  
to its objective of reflecting real world workloads.   Right now we're  
preparing for a new release.   Aside from contributing new workloads,  
you can help the Jikes RVM research community enormously by  
downloading the source from svn and getting batik, fop, sunflow and  
tomcat working---these are all broken on Jikes RVM [3] but are  
expected to be in the next release).   We plan to drop antlr, bloat,  
chart and hsqldb.

o  I think all of this requires some perspective.  My belief (of  
course I have no data to support it :-) is that if I were to sample  
publications from top-tier venues and critique their findings, many  
may have sub-standard analysis, but I suspect only a few fall to the  
point that their findings are actually false (due to failure of their  
analysis).  However, time and time again, I find results that I  
suspect (and sometimes have gone and verified) are indeed false due to  
more basic methodological failings, such as use of just a single  
hardware platform that happens to significantly bias the result, or  
running at just a single heap size, etc.

o  When researchers publish their source, they allow other researchers  
to confirm their findings.   I would like it very much if members of  
the Jikes RVM community would routinely publish their source along  
with their publication.   Better yet, if the findings are interesting,  
please contribute your outcome to the project.

o  I want to re-iterate the point about opportunity cost, which I  
think is the key to this discussion.  As a researcher, you have a  
finite experimental budget.   You need to chose whether to spend that  
budget on evaluating different heap sizes, more iterations, more  
configurations, etc.  Whatever anyone may wish to say about  
experimental design and methodology, if the approach is not explicitly  
acknowledging opportunity cost, then it is not grounded in reality so  
I'm inclined to read it with a dose of skepticism.

o  Finally, these discussions are healthy.   However, such discussions  
are always better if they're backed by concrete, constructive  
outcomes.  Specifically, I always like to hear concrete suggestions on  
how to resolve concrete problems: "A lot of researchers need to  
measure X.  It seems that the approach is biased or flawed because of  
A, B & C.  Here's a concrete approach for measuring X which addresses  
those shortcomings."  Or, contribute new, more realistic workloads to  
the benchmark suite.   Or, contribute harnesses which help suites such  
as dacapo generate more meaningful results.  In a nutshell: Existing  
methodology is imperfect; we all find it easy to identify flaws.  
Having identified some flaws, we each need to ask what constructive  
thing can we do about it.

Thanks again for contributing your thoughts to this mailing list.   I  
think the discussion will help us all.  It has helped me.

--Steve

PS, in the course of looking at what you had to say, I noticed that  
Jikes RVM was warming up slowly (something Dave had noticed two years  
ago).   So I adjusted the sample rate which gave the whole VM a nice  
little performance kick, just in time for the upcoming 3.1 release :-)  
[2]

[1] http://www.cs.colorado.edu/department/publications/reports/docs/CU-CS-1031-07.pdf
[2] http://jikesrvm.anu.edu.au/cattrack/results/habanero.anu.edu.au/perf/9230/performance_report
[3] http://dacapo.anu.edu.au/regression/perf/head.html


------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables
unlimited royalty-free distribution of the report engine
for externally facing server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Jikesrvm-researchers mailing list
Jikesrvm-researchers@...
https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers

 « Return to Thread: [rvm-research] Looking for Sources of Performance Variation