3.9.5: much improved threading

View: New views
5 Messages — Rating Filter:   Alert me  

3.9.5: much improved threading

by Clint Whaley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Guys,

I have finally gotten 3.9.5 out the door.  It's been several months, but I've
actually been busy the whole time.  The big news is that I finally finished
a complete rewrite of ATLAS's threading system.  You will see only a small
difference for 2-processor machines, but on 4 and 8 processor machines,
the new threaded code can more than double your performance (assuming you
aren't on a loser OS like MacOS X or FreeBSD, that don't posses processor
affinity).  I posted some factorization timings at:
   http://math-atlas.sourceforge.net/timing/3_9_5/index.html

This code is all pretty new, and since I rewrote everything, we can probably
expect a segfault-fest, but it at least passes the sanity tests on Windows
and Linux.  It is still fairly rough, and as the timing page mentions, I need
to do some tuning for small-case factorizations.

Let me know if you use it,
Clint

ATLAS 3.9.5 released 12/11/08, Changes from 3.9.4:
   * Complete rewrite of ATLAS threading system:
     - Now supports native windows threads in addition to pthreads
     - Use of master-last and affinity increases threaded performance, with
       an advantage that grows with P (almost no advantage for P=2, but for
       instance LU is more than 60% faster asymptotically on a P=8 Core2)
       + OS X and FreeBSD don't support processor affinity, and so their
         performance is still bad
   * Changed emit_buildinfo so that it replaces all control characters with
     spaces (prevents errors under windows).
   * Added dependency info for ATL_ilaenv so that it is recompiled once
     lapack tuning is complete
   * Fixed error in configure where it issues commands in wrong directory
     when the user builds lapack directly from a tarfile
   * Fixed typos in config.c where I used 'comp' rather than 'comps'.
   * Added mmtime_pt.c, which can allow us to find kernels that do well
     in parallel operation.
   * Various small configure fixes for windows

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: 3.9.5: much improved threading

by Ian Ollmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Dec 11, 2008, at 10:21 PM, Clint Whaley wrote:

> You will see only a small difference for 2-processor machines, but  
> on 4 and 8 processor machines, the new threaded code can more than  
> double your performance (assuming you  aren't on a loser OS like  
> MacOS X or FreeBSD, that don't posses processor affinity).

http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/


------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: 3.9.5: much improved threading

by Clint Whaley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ian,

>> You will see only a small difference for 2-processor machines, but  
>> on 4 and 8 processor machines, the new threaded code can more than  
>> double your performance (assuming you  aren't on a loser OS like  
>> MacOS X or FreeBSD, that don't posses processor affinity).
>
>http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/

Thanks for the link!  I asked my contacts at apple about this, and was told
you guys do not support affinity, so it is a relief to see you guys doing
something here.

Do you have a link to more documentation?  I'm finding this page a little
sparse on usage details.

I did some digging, and found that I can translate a pthread_t into a thread_t
using the pthread_mach_thread_np interface.  However, since pthreads start
executing when the pthread_create function is called, I guess I have to
start them up, and then change their affinity after they have started running
(the page suggests you start the thread, and then change their affinity before
they start running, and I don't see how this can be done in pthreads).  This
will blunt a lot of the advantage to affinity.  

The docs I found suggest that you should not use thread_t threads directly, so
how are you supposed to start the thread up w/o starting it running?

I don't suppose you have/are considering supporting something like linux has,
where you can modify the thread attribute using pthread_attr_setaffinity_np?

Can you supply some example code of using affinity with pthreads?  The page
has no example calls, and no mention of what values the thread affinity tag
can take, and what that would mean if they had a given value.

I notice that does not provide processor affinity, but rather something it
describes as L2 affinity.  I take this to mean that a thread will therefore
be allowed to migrate between processors that share a cache.  Our timings
indicate you will lose a lot of the performance gain if this case.  Do you
presently have any way, or plans to add support for, true processor affinity?

Why did you not add processor affinity when you mucked about with this
L2 affinity?

Thanks,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Parent Message unknown Re: 3.9.5: much improved threading

by j murphy :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


--- On Fri, 12/12/08, Clint Whaley <whaley@...> wrote:

> From: Clint Whaley <whaley@...>
> Subject: [atlas-devel] 3.9.5: much improved threading
> To: math-atlas-devel@...
> Date: Friday, December 12, 2008, 1:21 AM
> Guys,
>
> I have finally gotten 3.9.5 out the door.  It's been
> several months, but I've
> actually been busy the whole time.  The big news is that I
> finally finished
> a complete rewrite of ATLAS's threading system.  You
> will see only a small
> difference for 2-processor machines, but on 4 and 8
> processor machines,
> the new threaded code can more than double your performance
> (assuming you
> aren't on a loser OS like MacOS X or FreeBSD, that
> don't posses processor
> affinity).:

Clint:

I'm glad to see that you have a new version of atlas, and I'm looking
forward to trying it out.  Thank you for your efforts.

Regarding your comments about processor affinity in FreeBSD, from:

http://svn.freebsd.org/viewvc/base?view=revision&revision=176730

"Author: jeff
Date: Sun Mar 2 07:39:22 2008 UTC (9 months, 1 week ago)
Log Message: Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
 - Add a reference to a struct cpuset in each thread that is inherited from
   the thread that created it.
 - Release the reference when the thread is destroyed.
 - Add prototypes for syscalls and macros for manipulating cpusets in
   sys/cpuset.h
 - Add syscalls to create, get, and set new numbered cpusets:
   cpuset(), cpuset_{get,set}id()
 - Add syscalls for getting and setting affinity masks for cpusets or
   individual threads: cpuid_{get,set}affinity()
 - Add types for the 'level' and 'which' parameters for the cpuset.  This
   will permit expansion of the api to cover cpu masks for other objects
   identifiable with an id_t integer.  For example, IRQs and Jails may be
   coming soon.
 - The root set 0 contains all valid cpus.  All thread initially belong to
   cpuset 1.  This permits migrating all threads off of certain cpus to
   reserve them for special applications.

Sponsored by:   Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by:    antoine"

and from:

http://svn.freebsd.org/viewvc/base?view=revision&revision=180808

"Author: jhb
Date: Fri Jul 25 17:46:01 2008 UTC (4 months, 2 weeks ago)
Log Message: MFC: Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.  This is mostly synched up with what is in HEAD with the
following exceptions:
- I didn't MFC any of the interrupt binding stuff as it requires other
  changes and I figured this change was large enough as it is.
- The sched_affinity() implementation for ULE in HEAD depends on the newer
  CPU topology stuff as well as other changes in ULE.  Rather than
  backport all of that, I implemented sched_affinity() using the existing
  CPU topology and ULE code in 7.x.  Thus, any bugs in the ULE affinity
  stuff in 7 are purely my fault and not Jeff's.

Note that, just as in HEAD, cpusets currently don't work on SCHED_4BSD (the
syscalls will succeed, but they don't have any effect).

Tested by:      brooks, ps"

So I think that the functionality that you want, or something close to it,
has been present for some time in the development branches 7.1-STABLE,
7-STABLE, and 8-CURRENT of FreeBSD (which many people have been running,
not just active FreeBSD developers), and will be present in all releases
starting with 7.1, which should be out in a matter of weeks.

As you can see from the above, the authors of the slightly different
implementations in 8-CURRENT and 7*-STABLE are, respectively, Jeff
Roberson (jeff@...) and John Baldwin (jhb@...).  They may
be willing to help you with questions about how to best use their work.
(Jeff is also the primary author of the new ULE scheduler which is now the
default on the FreeBSD branches I mentioned above, and which must be used
with these tools.)

Or you can look at:

http://www.FreeBSD.org/cgi/man.cgi?query=cpuset&sektion=1&apropos=0&manpath=FreeBSD+8-current
http://www.freebsd.org/cgi/man.cgi?query=cpuset_getaffinity&sektion=2&manpath=FreeBSD+8-current
http://www.freebsd.org/cgi/man.cgi?query=cpuset_getid&sektion=2&manpath=FreeBSD+8-current

and the code in the FreeBSD Subversion or CVS repositories.

Few people know more than you do how difficult it is to keep up with the
developments in many different operating systems, and to write code that
works well on all of them. Knowing this, I am surprised that you are not
more chary of describing FreeBSD as a "loser OS" just because you believe
(erroneously, as it turns out) that it lacks one feature that you want to
use.  You may point to earlier implementations in other operating systems,
and say that FreeBSD came late to the table, but then this is true of your
own code, isn't it?

By the way, NetBSD has had similar functionality in their 5_BETA and
-current branches since the end of June of this year:

http://netbsd.gw.com/cgi-bin/man-cgi?affinity++NetBSD-current
http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libpthread/?only_with_tag=MAIN

Regards,
         J. Murphy


     

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: 3.9.5: much improved threading

by Clint Whaley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ian,

>> You will see only a small difference for 2-processor machines, but  
>> on 4 and 8 processor machines, the new threaded code can more than  
>> double your performance (assuming you  aren't on a loser OS like  
>> MacOS X or FreeBSD, that don't posses processor affinity).
>
>http://developer.apple.com/releasenotes/Performance/RN-AffinityAPI/

I gave this a quick scope, and it appears to be inadequate for what we
need, if I translate this page correctly.  It appears you guys are
following the horrible convention of calling one package a processor,
and that this page is then describing that you can use your affinity to
ensure one thread/package, but you cannot ensure one thread/core.  Is
this the case?

If so, you can definitely not do master last, and even persistant worker
will be messed up due to the scheduler moving things around within a
package.  Master and processor affinity (processor == core,
processor != package) can make a huge difference, as you can see:
   http://math-atlas.sourceforge.net/timing/newThr395/index.html

You can read about the techniques themselves in our IPDPS paper:
   http://www.cs.utsa.edu/~whaley/papers/ettIEEE.pdf

Is there any chance apple is going to provide core-level affinity sometime
soon?

Thanks,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel