|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
[rvm-research] Looking for Sources of Performance VariationThis is a troubleshooting question. I am trying to run the dacapo benchmarks with an older revision (14775) of Jikes, using the ''perf'' test-run (I adapted it to run only the dacapo benchmarks), but the measurements turn out to be very unstable. E. g. for dacapo-fop running the 9 warum-up + 1 timed iterations for 6 executions mostly gives me results within a limited rage (barring 3% of variation), but quite a number of measurements (about one fifth) are very far off (+>10%). I have attempted a small baseline compiler modification to safe some control flow profiling (edge counters); when I run the patched VM with this code, all measurements are catapulted into the higher ballpark.
I have switched off the AOS recompilation, which apparently also ensures that there is no invocation threshold-based recompilation. I am running Ubuntu 7.04 in single user mode on a 2-core Intel machine. The configuration is standard (profiled production build with classpath, dacapo 2006/10). I have not manually started additional system services (not even an Xvfb server to run dacapo-chart) and, as stated above, encounter the phenomenon on the out-of-the-box RVM. Right now, I am at a loss what might cause these fluctuations, so I am interested in any advice/ideas. Christian Sinschek, Technische Universität Darmstadt -- Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationHi Christian,
There are some very well known sources of performance variation for managed runtimes, such as Jikes RVM. However, it sounds like you have accounted for these. It is a little hard to tell though. It might help if you can provide the following information: - The *exact* command line used for a specific benchmark - The *exact* results produced by one of your runs (ideally a log of the 60 results you report below) - The *exact* hardware you are running on (you say it is dual core, do you mean Core 2 Duo?). You should not see any significant variation if you turn the AOS off. I do this fairly routinely. Nonetheless, I'd normally take 10 measurements even with the AOS off. WIth the AOS on, I'd be inclined to take 20 measurements. I don't think you need to time the 10th iteration. Take a look at the warm-up curves in the right column of this page (http://dacapo.anu.edu.au/regression/perf/2006-10-MR2.html) and you'll see that steady state is reached earlier than that. I typically use the 4th iteration. Cheers, --steve On 07/05/2009, at 10:13 PM, sunai@... wrote: > This is a troubleshooting question. I am trying to run the dacapo > benchmarks with an older revision (14775) of Jikes, using the > ''perf'' test-run (I adapted it to run only the dacapo benchmarks), > but the measurements turn out to be very unstable. E. g. for dacapo- > fop running the 9 warum-up + 1 timed iterations for 6 executions > mostly gives me results within a limited rage (barring 3% of > variation), but quite a number of measurements (about one fifth) are > very far off (+>10%). I have attempted a small baseline compiler > modification to safe some control flow profiling (edge counters); > when I run the patched VM with this code, all measurements are > catapulted into the higher ballpark. > > I have switched off the AOS recompilation, which apparently also > ensures that there is no invocation threshold-based recompilation. I > am running Ubuntu 7.04 in single user mode on a 2-core Intel > machine. The configuration is standard (profiled production build > with classpath, dacapo 2006/10). I have not manually started > additional system services (not even an Xvfb server to run dacapo- > chart) and, as stated above, encounter the phenomenon on the out-of- > the-box RVM. > > Right now, I am at a loss what might cause these fluctuations, so I > am interested in any advice/ideas. > > Christian Sinschek, > Technische Universität Darmstadt > -- > Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit > allen: http://www.gmx.net/de/go/multimessenger01 > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! > Your > production scanning environment may not be a perfect world - but > thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW > KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Jikesrvm-researchers mailing list > Jikesrvm-researchers@... > https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationFor other sources of variation you might want to read Mytkowicz et al's ASPLOS'09 (http://www-plan.cs.colorado.edu/klipto/) paper. Mike _____________________________________________________________ Michael Hind, Senior Manager, Programming Technologies Department IBM T.J. Watson Research Center http://www.research.ibm.com/people/h/hind 914 784-7589 My internal blog: http://blogs.tap.ibm.com/weblogs/hindsight
Hi Christian, There are some very well known sources of performance variation for managed runtimes, such as Jikes RVM. However, it sounds like you have accounted for these. It is a little hard to tell though. It might help if you can provide the following information: - The *exact* command line used for a specific benchmark - The *exact* results produced by one of your runs (ideally a log of the 60 results you report below) - The *exact* hardware you are running on (you say it is dual core, do you mean Core 2 Duo?). You should not see any significant variation if you turn the AOS off. I do this fairly routinely. Nonetheless, I'd normally take 10 measurements even with the AOS off. WIth the AOS on, I'd be inclined to take 20 measurements. I don't think you need to time the 10th iteration. Take a look at the warm-up curves in the right column of this page (http://dacapo.anu.edu.au/regression/perf/2006-10-MR2.html) and you'll see that steady state is reached earlier than that. I typically use the 4th iteration. Cheers, --steve On 07/05/2009, at 10:13 PM, sunai@... wrote: > This is a troubleshooting question. I am trying to run the dacapo > benchmarks with an older revision (14775) of Jikes, using the > ''perf'' test-run (I adapted it to run only the dacapo benchmarks), > but the measurements turn out to be very unstable. E. g. for dacapo- > fop running the 9 warum-up + 1 timed iterations for 6 executions > mostly gives me results within a limited rage (barring 3% of > variation), but quite a number of measurements (about one fifth) are > very far off (+>10%). I have attempted a small baseline compiler > modification to safe some control flow profiling (edge counters); > when I run the patched VM with this code, all measurements are > catapulted into the higher ballpark. > > I have switched off the AOS recompilation, which apparently also > ensures that there is no invocation threshold-based recompilation. I > am running Ubuntu 7.04 in single user mode on a 2-core Intel > machine. The configuration is standard (profiled production build > with classpath, dacapo 2006/10). I have not manually started > additional system services (not even an Xvfb server to run dacapo- > chart) and, as stated above, encounter the phenomenon on the out-of- > the-box RVM. > > Right now, I am at a loss what might cause these fluctuations, so I > am interested in any advice/ideas. > > Christian Sinschek, > Technische Universität Darmstadt > -- > Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit > allen: http://www.gmx.net/de/go/multimessenger01 > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! > Your > production scanning environment may not be a perfect world - but > thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW > KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Jikesrvm-researchers mailing list > Jikesrvm-researchers@... > https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
|
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationChristian (and others),
I thought it appropriate that I chime in to this discussion. As it turns out, I'm planning to give a talk next week on variability in JikesRVM results. While my slides are not yet complete, I've extracted some of the primary graphs generated from DaCapo and SPEC JVM98 and posted them at http://webhome.cs.uvic.ca/~rhodesb/research/JikesRVM_Performance.pdf These results were gathered from JikesRVM 3.0.1 (production configuration), running on a dual-core Pentium D with Ubuntu Linux (2.6.24-23-server SMP). The VM heap size was adjusted to 5x the minimum heap for each benchmark, as identified by Georges, et al. (see below). I should note at the outset that my primary interest is not discovering, quantifying and/or controlling performance. Like many others, I am working on an addition to Jikes and simply want to be able to consistently measure the effect of my modifications. Specifically, I would like to isolate the performance yield of different compilation strategies from other factors such as GC, scheduling, etc. I was relieved when I finally stumbled across the work of Andy Georges and company (http://doi.acm.org/10.1145/1297027.1297033), which confirmed my suspicions that in fact the results of running a standard "production" configuration on DaCapo & JVM98 are often quite unstable. Indeed, as the results from my slides above indicate, many of the benchmarks exhibit significant variability from run to run. It is possible to identify a statistically valid mean "best" score, but this often requires running well beyond the prescribed number of iterations. Moreover, there is often a distinct difference between taking an average of scores within an execution (capturing overall VM performance) and taking the best score from all repetitions. Probably the most important finding is that many of these benchmarks do not "converge" or "stabilize" after some minimum number of repetitions. In fact, it is quite clear that for some, the chance of instability actually increases as more adaptive re-compilation is applied. My personal recommendation is to ignore the DaCapo & SPEC JVM98 notions of "coefficient of variation" (CoV). It is clear that the idea works for some of the programs, and simply does not for others. If you are experimenting with compilation strategies, then it would seem that identifying the "best" performance requires taking an average over many executions of many iterations (I have been using at least 10 runs of 31 repetitions). As pointed out by, Georges & co, anything less has a strong likelihood of yielding potentially misleading results. Regards, Rhodes Brown Instructor & Ph.D. Candidate in Computer Science University of Victoria - Victoria, BC, Canada http://www.cs.uvic.ca > sunai@... wrote: > > This is a troubleshooting question. I am trying to run the dacapo > benchmarks with an older revision (14775) of Jikes, using the > ''perf'' test-run (I adapted it to run only the dacapo benchmarks), > but the measurements turn out to be very unstable. E. g. for dacapo- > fop running the 9 warum-up + 1 timed iterations for 6 executions > mostly gives me results within a limited rage (barring 3% of > variation), but quite a number of measurements (about one fifth) are > very far off (+>10%). I have attempted a small baseline compiler > modification to safe some control flow profiling (edge counters); > when I run the patched VM with this code, all measurements are > catapulted into the higher ballpark. > > I have switched off the AOS recompilation, which apparently also > ensures that there is no invocation threshold-based recompilation. I > am running Ubuntu 7.04 in single user mode on a 2-core Intel > machine. The configuration is standard (profiled production build > with classpath, dacapo 2006/10). I have not manually started > additional system services (not even an Xvfb server to run dacapo- > chart) and, as stated above, encounter the phenomenon on the out-of- > the-box RVM. > > Right now, I am at a loss what might cause these fluctuations, so I > am interested in any advice/ideas. > > Christian Sinschek, > Technische Universität Darmstadt ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationI have been informed that the slides I posted earlier do not display
properly in some versions of Adobe's PDF reader. I seem to have found a fix and have reposted a new version at the same address: http://webhome.cs.uvic.ca/~rhodesb/research/JikesRVM_Performance.pdf For those interested in the statistical data, I collected the following while regenerating the graphs. The results are over repetitions 2-11, 12-21, and 22-31 of each execution (the first repetition is ignored). The number of values sampled is in brackets [n]. The 'a' value is the arithmetic mean. The 'm' value the median. The 's' value the standard deviation. I found lusearch prone to crashing, so more results were gathered for it. Note, as Christian observed, the variance for fop (and others) is >10% depending on how you measure. antlr: 2-11: [100] a=3251.6, m=3184.5, s=336.136927 12-21: [100] a=2989.7, m=2919.5, s=306.256151 22-31: [100] a=2920.7, m=2904.5, s=274.450190 best: [10] a=2606.6, m=2605.5, s=30.434629 bloat: 2-11: [100] a=8776.9, m=8705.5, s=482.546524 12-21: [100] a=8396.5, m=8427.0, s=257.124797 22-31: [100] a=8358.3, m=8382.5, s=257.532937 best: [10] a=8147.8, m=8143.0, s=213.404571 chart: 2-11: [100] a=8935.3, m=8905.0, s=132.823365 12-21: [100] a=8851.0, m=8847.0, s=62.234483 22-31: [100] a=8859.2, m=8831.5, s=157.648035 best: [10] a=8773.3, m=8764.5, s=37.016663 eclipse: 2-11: [100] a=44795.4, m=44741.0, s=1590.573083 12-21: [100] a=43536.6, m=43648.0, s=1156.725204 22-31: [100] a=43182.5, m=43319.5, s=1363.051853 best: [10] a=40892.5, m=40705.0, s=539.081575 fop: 2-11: [100] a=1988.8, m=1926.0, s=222.580491 12-21: [100] a=1829.0, m=1770.5, s=174.642636 22-31: [100] a=1867.5, m=1761.5, s=419.558799 best: [10] a=1713.6, m=1716.0, s=16.714598 hsqldb: 2-11: [100] a=2827.3, m=2664.0, s=566.212160 12-21: [100] a=2457.0, m=2427.5, s=200.933126 22-31: [100] a=2395.0, m=2338.5, s=330.049919 best: [10] a=2228.8, m=2224.5, s=37.389244 jython: 2-11: [100] a=7245.7, m=7108.0, s=596.781056 12-21: [100] a=6482.6, m=6459.5, s=189.876568 22-31: [100] a=6328.3, m=6296.5, s=217.743325 best: [10] a=6218.3, m=6235.5, s=90.553913 luindex: 2-11: [100] a=11123.1, m=11052.0, s=468.323222 12-21: [100] a=10768.1, m=10705.5, s=344.060171 22-31: [100] a=10793.2, m=10783.5, s=327.755937 best: [10] a=10402.5, m=10399.5, s=88.545343 lusearch: 2-11: [120] a=4796.4, m=4626.5, s=433.733814 12-21: [120] a=4563.7, m=4505.0, s=172.024492 22-31: [120] a=4497.0, m=4476.0, s=167.335591 best: [12] a=4385.9, m=4409.5, s=92.450929 pmd: 2-11: [100] a=5268.2, m=5267.0, s=216.365668 12-21: [100] a=4979.9, m=4972.5, s=109.359262 22-31: [100] a=4942.5, m=4931.5, s=121.922934 best: [10] a=4808.6, m=4801.0, s=68.839911 xalan: 2-11: [100] a=6555.6, m=6417.5, s=382.274297 12-21: [100] a=6105.0, m=6099.0, s=208.984870 22-31: [100] a=5879.5, m=5867.5, s=139.621307 best: [10] a=5731.3, m=5735.5, s=72.657568 compress: 2-11: [100] a=4492.4, m=4453.5, s=116.976746 12-21: [100] a=4506.5, m=4475.5, s=123.351381 22-31: [100] a=4521.9, m=4501.0, s=123.614577 best: [10] a=4408.5, m=4390.5, s=59.716460 jess: 2-11: [100] a=1360.3, m=1349.0, s=48.228056 12-21: [100] a=1314.4, m=1307.0, s=49.882216 22-31: [100] a=1306.6, m=1309.0, s=26.932634 best: [10] a=1284.8, m=1285.0, s=16.949598 db: 2-11: [100] a=7855.5, m=7850.0, s=31.798753 12-21: [100] a=7871.3, m=7865.0, s=43.456202 22-31: [100] a=7868.2, m=7861.5, s=44.100235 best: [10] a=7805.8, m=7801.5, s=17.510632 javac: 2-11: [100] a=3755.7, m=3701.0, s=252.211086 12-21: [100] a=3422.6, m=3427.5, s=65.167588 22-31: [100] a=3334.0, m=3324.0, s=83.624452 best: [10] a=3268.0, m=3266.0, s=33.986926 mpegaudio: 2-11: [100] a=2927.5, m=2933.0, s=45.364823 12-21: [100] a=2918.1, m=2926.0, s=27.107856 22-31: [100] a=2940.8, m=2950.0, s=32.512405 best: [10] a=2857.7, m=2865.0, s=37.425630 mtrt: 2-11: [100] a=1122.4, m=1110.0, s=70.875478 12-21: [100] a=1052.6, m=1050.0, s=38.076503 22-31: [100] a=1054.0, m=1052.0, s=41.170495 best: [10] a=989.0, m=996.5, s=21.192242 jack: 2-11: [100] a=2923.7, m=2912.0, s=62.005226 12-21: [100] a=2847.2, m=2824.0, s=77.455152 22-31: [100] a=2812.3, m=2808.5, s=42.403597 best: [10] a=2780.6, m=2776.5, s=25.915246 Rhodes Brown Instructor & Ph.D. Candidate in Computer Science University of Victoria - Victoria, BC, Canada http://www.cs.uvic.ca ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationJust to note that some variation may also be possible across builds,
or across builds using different VMs, especially if doing parallel boot image creation. Note the object layout in the boot image is now configurable [1]. It's probably not worth looking at SpecJVM'98 too much due to its now very small execution times. Patches for the SpecJVM 2008 harness and to fix the RVM to run it are applied in the MRP source tree [2] (which also as an added bonus now runs on Windows using BaseBase configurations). Ian [1] http://icooolps.loria.fr/icooolps2008/Papers/ICOOOLPS2008_paper04_Rogers_Zhao_Watson_final.pdf [2] http://mrp.codehaus.org/ ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationHi Rhodes,
Thanks for providing your data. I think this is an important subject and I'm glad the mailing list is discussing it! :-) I took a look through your slides and read your post and have a few comments. I also have some data that I gathered to measure the performance of our pending 3.1 release against 3.0.1. o I guess you know this, but to be clear: what you've seen has nothing to do with Jikes RVM per se, but rather, it is a property of modern high performance VMs. The data below shows that production JVMs from Sun and IBM exhibit very similar characteristics. You see similar data in publications, and you'll see it in another form here (http://dacapo.anu.edu.au/regression/perf/2006-10-MR2.html ) where we continuously track performance for DaCapo, including warm- up curves. It is also important to acknowledge that this is not a deficiency in the workloads or in the VMs, but rather it reflects the way modern applications behave on modern VMs. Above all, as I show below, varying the architecture is crucial. o Some of the problems you allude to have been very well thrashed out in the literature. In particular, there's be a lot of work on methodology for measuring garbage collection meaningfully, perhaps less on methodology for JIT evaluation (and these are quite different problems). o We should be clear on terminology. I believe standard terminology is that "invocation" is a JVM invocation, and "iteration" is an iteration of a benchmark (within a single JVM invocation). o I found it a bit hard to see exactly what points you are trying to make. I'm guessing that you're thinking about some of the following: a. As a VM iterates over a benchmark, performance will typically improve to some asymptotic limit. b. Warmup curves for a single invocation do not exhibit monotonic improvement; performance goes up and down within a given invocation. c. For a given iteration of a given benchmark, performance for a given JVM will vary from invocation to invocation. o You say: > Indeed, as the results from my slides above indicate, many of the > benchmarks exhibit significant variability from run to run. It is > possible to identify a statistically valid mean "best" score, but this > often requires running well beyond the prescribed number of > iterations It's a bit unclear what you mean here. It sounds a bit like conflation of points a) through c) above. Further, it is important to understand that there is no "prescribed" number of iterations (or invocations). As a researcher you need to design your experiment appropriately, and this means choosing such parameters sensibly, given your objectives, your context, and your constraints. So I'm not quite sure what you are getting at. In my previous post I'd mentioned off-hand using the 4th iteration and 20 invocations when measuring Jikes RVM's overall performance with the AOS turned on (the basis for the 4th iteration was that this is roughly the knee in the curve for Jikes RVM's warmup on DaCapo). So I'm guessing part of what you say is in response to that. To be sure, if I reduced the iteration count, then I would be further away from the asymptotically best performance. On the other hand, increasing the iteration count would have the opposite effect. The question is whether this matters or not. The answer is entirely dependent on what it is that you want to show. More importantly, you need to weigh the cost of further iterations against what else you could have done with your experimental budget (ie the "opportunity cost" of running those extra iterations). You also need to consider what is "meaningful" (compiling the exact same program N times in a row is perhaps not particularly "meaningful", if it is a goal that the evaluation be somehow representative real world workloads---FWIW you may want to think about SPECjvm2008 in this light). o Your choice of hardware platform can have a dramatic affect, and will often dominate over other issues (such as asymptotic performance, as an example). Some machines (such as the Pentium-D and its cousins ;-) are notoriously brittle. In the case of the Pentium-D, famously the trace cache can sometimes lead to very counter-intuitive results, and more generally, the very deep pipe probably accentuates underlying noise. Running your experiments on multiple machines is essential. Just as an example, in the results below, for pre3.1, on antlr, the P4 results have an 95ci of 15.3% while the c2q has 3.5% (ie the P4 was 5 X more noisy than the C2Q on that particular benchmark). o A small nit with your graphs... You need to include the origin if you don't have normalized data. Otherwise your data looks very exaggerated. This is a standard gotcha :-) Either include the zero point, or change the y axis so it is normalized. o You may find interesting the data I've been gathering to compare our pending 3.1 release with 3.0.1. Since I'd just read your post, I went and modified our scripts so that I can produce some warm-up data (and ran things through 32 iterations just for this experiment!). The data below is all gathered over 32 iterations of each benchmark and 20 invocations. I show means and 95% confidence intervals (expressed as a percent of the result). I've done the measurements on an i7, a core 2 quad, a pentium 4 and an atom. I'd normally include a PPC machine too, for an entirely different ISA, but that turned out not to be convenient when I set off the runs yesterday. I've included numbers for Sun's HotSpot and IBM's J9, each with a stack of performance flags turned on (server mode, etc etc). The data takes time to generate; the graphs are incrementally updated with new benchmarks as new data becomes available. - The ostensive reason for these measurements was to compare 3.0.1 against the pending release. Compare p4, c2q and i7 results: http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.jikes.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7/bmtime.jikes.html - Warm-up numbers for 4 different JVMs on the c2q (first bar is the final iteration, subsequent bars are warm-up iterations): http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikessvn-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.jikes301-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.sun-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.ibm-warmup.html - Note that Jikes RVM's warm-up profile is fairly similar to hotspot - See the 95% CI numbers and see that when a give iteration is measure 20 times the result is fairly stable (even more so for later iterations). - If you look at the same graphs on the P4 you'll see far more variation - Take a look at the 95% CI's across JVMs: http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.all.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.all.html - Jikes' variability is similar to each of the other JVMs. The choice of platform is the biggest factor. - One way to reduce noise is to turn off the adaptive optimization system. I've done that here and forced everything to be O1 compiled: http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4/bmtime.aos.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q/bmtime.aos.html - Notice how much lower the 95% CI is, particularly on the noisy P4. - You can browse the data further if you're interested: http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/p4 http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/c2q http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7 http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/atom (data not online yet at time of writing) o There have been a quite a few interesting studies of these issues. Some particularly interesting work has come from Amer Diwan's group, his colleagues and (former) students. Mike already pointed out one of their ASPLOS papers from this year. o My take home from all this: - There is no simple prescription. You need to understand your system and your hypothesis, and carefully design the experiments to suit. - Consider the opportunity cost when making a decision: you don't have infinite resources, so if something offers diminishing returns you need to think very carefully whether your resources would be better spent running some other experiment, evaluating a new benchmark, etc. I would never normally run 32 iterations. I just did that this time out of interest. :-) - Architecture (in fact the entire environment) really matters. Results from just one machine can be very misleading. Thanks again for raising those interesting issues and sharing your slides with us. The data is really interesting, and this is an important discussion to have. Cheers, --Steve ------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationHello again all,
I wanted to complete my talk (to the IFIP Software Implementation Technology Working Group) and get some feedback before following up on Steve's reply to my original post. For those who are interested, I have posted my slides with notes at: http://webhome.cs.uvic.ca/~rhodesb/research/JikesRVM_Performance-Notes.pdf As Steve noted, the issues I'm raising aren't necessarily specific to Jikes. Cliff Click confirmed that he'd observed similar behaviors working with the HotSpot VM. However, the issue of measuring performance under adaptive optimization is clearly of particular importance to Jikes researchers--especially those of us who's work doesn't afford the luxury of being able to turn off the AOS. Steve's point about normalizing data is well-taken. In my posted slides, I have included a second axis that shows iteration times normalized against the best overall time observed for each benchmark. I've also tried to switch from "repetition" to "iteration" of a benchmark, to be more in line with the common terminology. However, I've stuck with "execution" over "invocation" of a VM, since my own work deals with method invocations and I don't want to confuse the two. That said, if we are on the topic of presentation clarity, I'd like to raise a couple of questions of my own. First is the use of the geometric mean ("geomean") as an aggregate measure of performance. I see this all over the place in papers reporting on Jikes performance, but I have not been able to find a single one that justifies the use of this mean. John makes a fairly cogent argument that performance results, in particular speedup results, should not be summarized with the geomean [1]. Depending on one's emphasis, a weighted arithmetic or harmonic mean is more appropriate. It would seem that the geomean is only (arguably) appropriate in cases where the results exhibit a log-normal distribution *and* are representative of real workloads--both debatable points when it comes the commonly used Java benchmarks. Second, and this is at the core of the point I was trying to make, what is "bmtime"? Is this total running time for some number of iterations? The time from a particular iteration, say the last? An average (mean or median) of iterations within an execution? Does it include JIT compilation, or is such a question even meaningful? To be clear, let me try to re-state some of the points I was trying to make earlier. My primary intention was to debunk the myth of convergence. Some benchmarks do, after executing a reasonable number of iterations, approach a "typical" performance pattern with a CoV less than 0.02. But some simply do not, regardless of how many executions or iterations are run (antlr and hsqldb are examples). Some do converge for some executions, but not others. Some stabilize, but not to the same performance level. Moreover, many benchmarks actually begin to de-stabilize when run longer with more time for adaptive optimization. Thus, while the method suggested by Georges, et al does provide an appropriate level of rigor, it will not always work and should not be entirely relied upon. And certainly the rudimentary notions of convergence built into DaCapo and SPECjvm98 should not be relied upon. My second point was to emphasize that there is an important distinction between measuring "typical" performance over a range of iterations from an execution (as done by Georges, et al), and measuring the best performance potential of a particular VM configuration. The latter is appropriate when comparing modifications that may affect several sequential iterations, as is the case for most GC strategies. However, identifying the effectiveness of a compilation strategy is clearly the former. In this case, we are interested in identifying the maximum potential of the generated code while discounting other factors. Of course, to be statistically valid, one must find a mean-best result, not simple take the best overall value. I would concur with a sentiment seen in several papers on performance analysis, and echoed by Steve above: "There is no simple prescription. You need to understand your system and your hypothesis, and carefully design the experiments to suit." Indeed, but I would go further: When publishing performance results, one must choose an approach that is properly aligned with the subject of one's study (eg. start-up, long-run GC, long-run adaptive optimization, etc.) *and* present an argument for why the approach is appropriate. This latter part seems absent from most papers on Jikes performance that I have read. As a final point, I think it is worth noting that we (as a community) have made an inappropriate simplification in treating performance as a "random" variable. While there are many complicated factors that can influence performance from platform to platform, and run to run, the results are still effectively determined by the VM and system configuration. The true source of randomness is timing. Small and unpredictable external pressures ultimately lead to executions that unfold in a bounded, but chaotic fashion. In devising measurement schemes, we ought to be conscious of this effect and aim to extract results in a way that is, as much as possible, oblivious to timing variations. Thus, I would reject methods that report results from a specific iteration, or aggregates over a fixed interval of time or iterations. -- Rhodes H. F. Brown Instructor & Ph.D. Candidate in Computer Science University of Victoria - Victoria, BC, Canada http://www.cs.uvic.ca References: [1] L. K. John. Performance Evaluation and Benchmarking, chapter 4: Aggregating Performance Metrics Over a Benchmark Suite. CRC Press. 2005. ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationIn defense of certain uses of the geo mean ...
A key property of the geo mean is that the geo mean of a collection of ratios is equal to the geo mean of their numerators divided by the geo mean of their denominators. That is, the geo mean of the ratios of pairs of numbers equals the ratio of the geo means. This suggests that it is good for dealing with collections of ratios. Thus: If I run a benchmark suite of (say) 20 benchmarks, summarizing the overall performance of the suite using the geo mean of the individual performance times is sensible ... when desiring to compare against runing the same suite under some different condition (say with a new compiler optimization, or on a different hardware platform). If one takes the geo mean of the times under the "new" treatment and divides that by the the geo mean of the times under the "old" treatment, you get (I claim) a sensible summary of the *ratios* of the performance of the benchmarks for each treatment. A key thing here is that one benchmark may run a lot longer than another one -- but if I take the ratio of the geo means, that does not matter, since what I am really computing is the the performance different as a ratio (i.e., "new" / "old"). The geo mean also tends to prevent an outlier from dominating the measurement of central tendency. I suppose you can like that or dislike it, but it is true. Beyond this, I have found the use of geo mean (on the one hand) and arithmetic or harmonic mean (one the other hand) to be an issue argued with religious fervor -- much passion, not much conversion. I still stand by it as a sensible way to summarize a benchmark suite's performance, especially for comparing against other runs of the same suite using ratios. Best wishes -- Eliot Moss ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance VariationOn 16/05/2009, at 10:25 AM, Rhodes Brown wrote:
> My primary intention was to debunk the myth of convergence. I guess I'm not quite sure what the "myth of convergence" is. I think many, if not most, people are aware that performance of a JVM does not always converge to some tightly bounded point within any single invocation. More broadly, the idea of chaotic behavior is fairly well established. Eliot Moss has been describing JVM performance in exactly those terms ("chaotic behavior") since I was doing a postdoc about 10 years ago. We took this pretty seriously in the context of GC research, because we observed that small perturbations in mutator behavior often manifest as huge swings in GC performance. Amer Diwan's group has also looked at this a lot and have gone further to note chaotic behavior at the hardware level [1]. I made this point in my previous email: > b. Warmup curves for a single invocation do not exhibit monotonic > improvement; performance goes up and down within a given invocation. I think most people reading this list would agree that observation is pretty unremarkable. This is one of the reasons why we take means across a significant number of invocations. I have not used the per- invocation convergence tools provided by harnesses such as SPEC and DaCapo since that approach is not meaningful in the context of what I normally measure (though I assume they are useful to some people). So perhaps there's some debunking to be done surrounding the use of such tools. I don't know. If that's what you're thinking, then I recommend you come at with a working alternative in hand. Let's look at one of the alternatives approaches: taking the time for a given iteration and then averaging that over multiple invocations. This is the approach I used in the data I pointed to in the last post. The opaque name "bmtime" just referred to the time the benchmark reported on its final (32nd) iteration. The other pages showed warmup data; times for each iteration. In the case where replay compilation is used (as in our GC work), this is fairly straightforward. For an AOS, there are at least two questions: a) which iteration/s to time, and b) whether or not one can assume that the average (cross invocation) performance curve is monotonic. If the answer to b) is yes, then it is fairly easy to decide what to do (depending on what you're measuring). If the answer to b) is no, then there are at least two conclusions: one should try to understand why the systemic perturbations arise, and if one cannot remove them, one should mitigate against them analytically. Garbage collection is one such source of systemic perturbation, which is one of the reasons why we advocate measuring multiple heap sizes. Clearly if you present results that average particular iterations across invocations, one must choose the iterations carefully. You may have noticed that in the jikes rvm 6-hourly performance regressions [2], we report 1st, 3rd and 10th iterations, as well as numbers for both generous and "tight" heaps. Now this is not deeply principled, but it does give you _some_ insight into the startup overhead, the steady state performance, the rate of convergence, and the effect of heap pressure. On the dacapo web site [3] we show the same three results and additionally the warmup curves. I decided to include the warmup curves because I believed in point b) above and think it is important to see how each of the systems is warming up. So, if steady state and warmup are important to what it is that you're measuring, it is pretty clear that you should explicitly measure (and plot) that. Which is why we do. And it is clear from those curves that some benchmarks and some VMs have particularly chaotic behavior. Finally, just as I've struggled a bit to nail down what problem it is you're pointing to, I'm not quite sure what it is you're proposing by way of a solution. A few other minor notes: o You might find this data interesting. Here I measure 49 iterations (I only plot every 5th for the sake of space, but can add them all in if anyone really wants it), on both hotspot and jikes rvm. One thing to note is that this data is pretty smooth and I don't think there are any examples of iteration-to-iteration variance that go outside the expected monotonic convergence curve by an amount that is greater than the 95% CI of the measured point. If there are, then that's quite interesting (I didn't spot any). http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7wu/bmtime.jikes-warmup.html http://cs.anu.edu.au/~Steve.Blackburn/private/results/jikesrvm-performance-2009/i7wu/bmtime.sun-warmup.html o In deference to the differing opinions on how to aggregate data, I generally include both arithmetic and geometric mean when I publish. I also include min and max results---these often get lost and are sometimes the most important information. I've recently started taking this further and including (on my web page) all of the raw and tabulated data so that other researchers can scrutinize it at will. o You make the comment that it is debatable whether benchmarks (including dacapo) are representative of real world workloads. Well yes. No suite can be perfect. However, the dacapo suite explicitly _trys_ to do this by trying to use unmodified source of widely used java programs. Since dacapo is open source, the onus is on you and other researchers to provide concrete feedback, better yet, to propose better workloads and contribute source. It is only through contributions such as this that the workload stays live and lives up to its objective of reflecting real world workloads. Right now we're preparing for a new release. Aside from contributing new workloads, you can help the Jikes RVM research community enormously by downloading the source from svn and getting batik, fop, sunflow and tomcat working---these are all broken on Jikes RVM [3] but are expected to be in the next release). We plan to drop antlr, bloat, chart and hsqldb. o I think all of this requires some perspective. My belief (of course I have no data to support it :-) is that if I were to sample publications from top-tier venues and critique their findings, many may have sub-standard analysis, but I suspect only a few fall to the point that their findings are actually false (due to failure of their analysis). However, time and time again, I find results that I suspect (and sometimes have gone and verified) are indeed false due to more basic methodological failings, such as use of just a single hardware platform that happens to significantly bias the result, or running at just a single heap size, etc. o When researchers publish their source, they allow other researchers to confirm their findings. I would like it very much if members of the Jikes RVM community would routinely publish their source along with their publication. Better yet, if the findings are interesting, please contribute your outcome to the project. o I want to re-iterate the point about opportunity cost, which I think is the key to this discussion. As a researcher, you have a finite experimental budget. You need to chose whether to spend that budget on evaluating different heap sizes, more iterations, more configurations, etc. Whatever anyone may wish to say about experimental design and methodology, if the approach is not explicitly acknowledging opportunity cost, then it is not grounded in reality so I'm inclined to read it with a dose of skepticism. o Finally, these discussions are healthy. However, such discussions are always better if they're backed by concrete, constructive outcomes. Specifically, I always like to hear concrete suggestions on how to resolve concrete problems: "A lot of researchers need to measure X. It seems that the approach is biased or flawed because of A, B & C. Here's a concrete approach for measuring X which addresses those shortcomings." Or, contribute new, more realistic workloads to the benchmark suite. Or, contribute harnesses which help suites such as dacapo generate more meaningful results. In a nutshell: Existing methodology is imperfect; we all find it easy to identify flaws. Having identified some flaws, we each need to ask what constructive thing can we do about it. Thanks again for contributing your thoughts to this mailing list. I think the discussion will help us all. It has helped me. --Steve PS, in the course of looking at what you had to say, I noticed that Jikes RVM was warming up slowly (something Dave had noticed two years ago). So I adjusted the sample rate which gave the whole VM a nice little performance kick, just in time for the upcoming 3.1 release :-) [2] [1] http://www.cs.colorado.edu/department/publications/reports/docs/CU-CS-1031-07.pdf [2] http://jikesrvm.anu.edu.au/cattrack/results/habanero.anu.edu.au/perf/9230/performance_report [3] http://dacapo.anu.edu.au/regression/perf/head.html ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
|||||||||||
|
|
Re: [rvm-research] Looking for Sources of Performance Variation2009/5/16 Steve Blackburn <Steve.Blackburn@...>:
>... > o Finally, these discussions are healthy. However, such discussions > are always better if they're backed by concrete, constructive > outcomes. Specifically, I always like to hear concrete suggestions on > how to resolve concrete problems: "A lot of researchers need to > measure X. It seems that the approach is biased or flawed because of > A, B & C. Here's a concrete approach for measuring X which addresses > those shortcomings." Or, contribute new, more realistic workloads to > the benchmark suite. Or, contribute harnesses which help suites such > as dacapo generate more meaningful results. In a nutshell: Existing > methodology is imperfect; we all find it easy to identify flaws. > Having identified some flaws, we each need to ask what constructive > thing can we do about it. One thing that is currently "known" in industry but not tested in Jikes RVM is that "real VMs" must deal with large cold methods - the main example being JSP code. Given this is a performance pathology for Jikes RVM (it must always baseline compile, carry round GC maps, etc.) it's quite telling that it's not had to care about it because no major benchmark suite has these methods in. Another example is DaCapo not touching any Java 5 features and thereby not justifying any optimizations on generics. SPECjvm2008 does a better job of using Java 5 features, but the JSP problem remains open for benchmark contributions. Regards, Ian -- MRP == More Research Please, run SPECjvm2008 on a Jikes RVM based source base ------------------------------------------------------------------------------ OpenSolaris 2009.06 is a cutting edge operating system for enterprises looking to deploy the next generation of Solaris that includes the latest innovations from Sun and the OpenSource community. Download a copy and enjoy capabilities such as Networking, Storage and Virtualization. Go to: http://p.sf.net/sfu/opensolaris-get _______________________________________________ Jikesrvm-researchers mailing list Jikesrvm-researchers@... https://lists.sourceforge.net/lists/listinfo/jikesrvm-researchers |
| Free embeddable forum powered by Nabble | Forum Help |