WARNING: This server is unstable and will be retired in the next days. If you want to keep this forum available, please request immediately a migration on the Nabble Support forum. Forums that don't receive any migration request will be deleted forever.

Mutex performance

View: New views
3 Messages — Rating Filter:   Alert me  

Mutex performance

by David Chisnall :: Rate this Message:

| View Threaded | Show Only this Message

Hi everyone,

[Long, rambling, and potentially boring email follows.  If you are not  
interested by different threading implementations then please stop  
reading now.]

I thought I'd write a simple spin lock implementation, using inline  
functions and an atomic test-and-set operation to compare efficiency  
when implementing something like Apple's Grand Central.  I wrote a  
simple program which spawned n threads, each incremented a 64-bit  
volatile global by m, acquiring and releasing a lock on either side of  
the increment, waited for all of the threads to exit, then tested that  
the counter was equal to n * m.  I used a 64-bit global on i386 so  
that it would require two loads and two stores and give extra time for  
things to go wrong from lack of synchronisation.

My spin lock was very simple; it just set an int value to 1 using an  
atomic test-and-set instruction and called sched_yield() (which gives  
scheduling time to another thread) if another thread held the lock.

By setting n to 1, I could simulate the condition where there was no  
lock contention.  On all the platforms I tried (OS X, GNU/Linux, and  
FreeBSD), performance was similar (within an order of magnitude) using  
a native pthread mutex and a spinlock.

The interesting things happened when I set n to 4, giving more  
contention.  On GNU/Linux and FreeBSD, performance using a pthread  
mutex was very close to using my spinlock - presumably they use very  
similar implementations internally, but have a cross-library function  
call on top.  On OS X, the story was somewhat different.  For small  
values of m, the performance was a couple of orders of magnitude  
worse.  For larger values of m, it was so bad I gave up waiting.  It  
seems that Darwin's libc does a couple of system calls for mutex  
operations, which gives horrendous performance.  You can see this from  
the results of timing the program (n = 4, m=100,000) on OS X:

real 0m15.937s
user 0m4.881s
sys 0m14.570s

Note the sys value; almost all of the time is spent in the kernel.  
Interestingly, with n set to 1, almost no time was spent in the  
kernel; apparently Apple optimise heavily for the uncondented case  
although since running the program on a 2.16GH Core 2 Duo on OS X with  
n=1 was only 10% faster than running it with n=1 on a 1.15GHz Athlon  
running FreeBSD, I suspect that this is a serious case of misplaced  
optimisation.  Note that this is not optimisation for the purely  
single-threaded case; spawning another thread that does nothing does  
not change the results by a measurable amount.

Out of interest, I tried the same test program using the libobjc mutex  
functions and with no locking (n=4, m=100,00,000) on FreeBSD.  The  
results were about what I'd expect (I've only quoted the user times,  
because the system times are approximately 0 in all cases and the real  
time is close to the user time in all cases):

objc:
user 0m8.607s

pthread:
user 0m3.493s

spinlock (lock functions not inlined; inlining shaves around 30-40%  
off  times):
user 0m1.372s

no locking (generates the wrong result about 50% of the time):
user 0m0.219s

The no locking version, obviously, is much faster, but gives the wrong  
results.  I also tried a version of the spinlock that didn't yield in  
case of contention and this performed about 50% worse than the one  
that did.  Adding any kind of locking makes it a lot slower, with the  
performance differences between my spinlock, pthread (which does a  
little more error checking and has the call to position-independent  
code) and the libobjc (which adds some more layers of indirection)  
ones being about what I anticipated.

I was surprised that the libobjc mutex is (on FreeBSD, at least; the  
GNU/Linux machine I had access to didn't have libobjc installed) so  
much slower than using pthreads directly.  I was expecting it to be  
closer to 5-6s.  For comparison, I wrote a trivial class which  
inherited from Object and provided -lock and -unlock methods that  
called the corresponding pthread mutex functions and was surprised to  
discover that this was faster than using libobjc functions; somehow  
they add more overhead than a message lookup.

pthread via message send:
user 0m6.308s

(This is good news; I recently rewrote all of the GNUstep  
synchronisation classes to use pthreads directly instead of the  
libobjc abstraction layer and Gregory is going to commit my diff after  
the next release of -base, so expect NSLock to speed up a lot at the  
next major [ABI-breaking] GNUstep-base release.  Extrapolating from  
these numbers I'd expect using the current NSLock to take almost twice  
as long).

One of the main reasons I did this was that I was interested by  
Apple's claim that adding a new object to a Grand Central Dispatch  
queue in 15 instructions.  This is somewhat disingenuous.  Locking and  
unlocking the spinlock is 5 instructions, but these instructions  
acquire the lock line and so are not particularly fast instructions.  
In terms of real-world performance, acquiring and releasing a pthread  
mutex is fairly close (factor of two) on any platform I tried except  
OS X; maybe they'd shout about it a bit less if their pthread  
implementation[1] were not an embarrassment, as the performance gain  
relative to Apple's mutex implementation is a factor of 100 or so,  
while relative to anyone else's (including one I wrote in five  
minutes) is closer to a factor of 2.  In short, not an optimisation  
worth bothering with unless we discover that it's really a bottleneck.

David

[1] I also tried on OS X using their implementation of NSLock and got  
almost identical numbers to using pthreads directly; the overhead of  
the message send is very small.

_______________________________________________
Etoile-dev mailing list
Etoile-dev@...
https://mail.gna.org/listinfo/etoile-dev

Re: Mutex performance

by Quentin Mathé-2 :: Rate this Message:

| View Threaded | Show Only this Message

Hi David,

Le 10 août 09 à 01:36, David Chisnall a écrit :

> One of the main reasons I did this was that I was interested by
> Apple's claim that adding a new object to a Grand Central Dispatch
> queue in 15 instructions.  This is somewhat disingenuous.  Locking and
> unlocking the spinlock is 5 instructions, but these instructions
> acquire the lock line and so are not particularly fast instructions.
> In terms of real-world performance, acquiring and releasing a pthread
> mutex is fairly close (factor of two) on any platform I tried

Interesting to know.

> except  OS X; maybe they'd shout about it a bit less if their pthread
> implementation[1] were not an embarrassment, as the performance gain
> relative to Apple's mutex implementation is a factor of 100 or so,
> while relative to anyone else's (including one I wrote in five
> minutes) is closer to a factor of 2.

The result could be completly different in Mac OS X 10.6 though.
In fact, I'd be suprised if they didn't rewrite their mutex  
implementation given the result you report.

Cheers,
Quentin.
_______________________________________________
Etoile-dev mailing list
Etoile-dev@...
https://mail.gna.org/listinfo/etoile-dev

Re: Mutex performance

by David Chisnall :: Rate this Message:

| View Threaded | Show Only this Message

On 10 Aug 2009, at 23:42, Quentin Mathé wrote:

> Hi David,
>
> Le 10 août 09 à 01:36, David Chisnall a écrit :
>
>> One of the main reasons I did this was that I was interested by
>> Apple's claim that adding a new object to a Grand Central Dispatch
>> queue in 15 instructions.  This is somewhat disingenuous.  Locking  
>> and
>> unlocking the spinlock is 5 instructions, but these instructions
>> acquire the lock line and so are not particularly fast instructions.
>> In terms of real-world performance, acquiring and releasing a pthread
>> mutex is fairly close (factor of two) on any platform I tried
>
> Interesting to know.

Yup.  I was wondering if they were using some clever lockless  
structure, but if we just use an inline version of the little mutex I  
wrote using the GCC atomic ops (on OS X I had to use some inline asm  
because Apple's GCC 4.0 doesn't support atomic ops) then we should get  
the same performance.  Of course, a full implementation of GCD needs  
feedback from the kernel to let each process determine the optimum  
number of threads to spawn.  This can possibly be done using the  
existing code we have for counting the number of CPUs and a small  
amount extra for the amount of load.

>> except  OS X; maybe they'd shout about it a bit less if their pthread
>> implementation[1] were not an embarrassment, as the performance gain
>> relative to Apple's mutex implementation is a factor of 100 or so,
>> while relative to anyone else's (including one I wrote in five
>> minutes) is closer to a factor of 2.
>
> The result could be completly different in Mac OS X 10.6 though.
> In fact, I'd be suprised if they didn't rewrite their mutex
> implementation given the result you report.

Looking a bit more carefully, it appears that the problem is that they  
use a Mach semaphore rather than sched_yield() to make the mutex  
sleep.  This is an optimisation in cases where you expect threads to  
block for a long time but generally this is not how you use a pthread  
mutex (that's what a condition variable is for).  A better  
implementation would spin a few times without sleeping, a few more  
times calling sched_yield(), and then wait on a semaphore atomically  
setting a flag in the mutex to instruct the corresponding unlock call  
to signal the semaphore.  They may be doing this and just have really  
badly-tuned heuristics, in which case it's quite easy to fix.  If they  
haven't already fixed it though, they probably won't before 10.6 is  
released; changes to mutex implementation are not things that you  
should rush.

A perfect implementation of a mutex would encode the average waiting  
time for each {thread, mutex} pair and use this to determine the  
waiting policy.  In most cases, however, sched_yield() is sufficient  
because good code doesn't hold mutexes for very long (Amdahl's law)  
and so just kicking the other threads to run for a bit is usually  
sufficient to let the one that holds it give it up.  And, of course,  
you should never be using a mutex in a situation where you need  
latency guarantees.

In short, it looks like someone just copied the SysV semaphore code  
into the pthreads implementation without noting that SysV semaphores  
and pthread mutexes are used in very different ways.

David
_______________________________________________
Etoile-dev mailing list
Etoile-dev@...
https://mail.gna.org/listinfo/etoile-dev