Note that you need to edit kernel_example to make it use more nodes. Also, after a certain point there is no speedup, for a set number of data points. With a larger number of data points you can get a good speedup with a larger cluster. See my paper in Computational Economics for some more information on that. http://ideas.repec.org/a/kap/compec/v26y2005i2p107-128.html