New libc malloc patch

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 | Next >

Parent Message unknown Re: New libc malloc patch

by Johan Bucht :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ugh, forgot that I used another smtp server than the one subscribed to
the list. So the mail shouldn't have made it to the list, if it has I
apologize. Remembered one more thing about locking granularity too.

Just a few comments, not sure if all of them are valid since I don't
have a recent enough system to apply the patch and test it, so my
comments are just based on reading the patch. I might have missed some
thing hidden between all the lines differing. Can you put up a patched
malloc.c somewhere?

I made a proof-of-concept parallel allocator for my Master's Thesis
during the spring and some of the comments are about differences and
similarities between our implementations.

* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.

* magazines, per-arena caches
If I didn't miss something you don't seem to have it, pardon if I missed
them in my quick readings. They can really help the scalability and is
something that is implemented in Solaris libumem allocator with great
success. I implemented them and they helped bring the scalability up
quite a bit on the system I tested my allocator on aswell as bringing
the allocation latency down.

* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the allocation for
* lockless size lookup.

* Bucket sizes
If I didn't miss something you are using power of two sizes. Might be
better to use smaller sizes to avoid internal fragmentation, aswell as
improving locking granularity.

* Locking granularity
You use a single lock for the chunk allocation instead of a lock per
bucket size or something similar. This means that you limit access
unless threads allocate from their private arenas. Since you hash to get
an arena there might be multiple threads accessing the same arena at the
same time introducing contention. Might be better to have a lock per
allocation size to somewhat improve the situation.

* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc implementations.
Might be interesting to implement some of the ideas from the Linux futex
implementation to help umtx.

* sbrk vs mmap
See you added the ability to try brk first and try mmap if brk fails.
This is similar to my implementation aswell as ptmalloc. Looks good.

* Big allocations
It might be preferable to use mmap/munmap directly for big allocations
(bigger than a few pages) to avoid the overhead of trying to be smart.
These big allocations should be pretty few and shouldn't be that
pessimized by this. It also means some memory savings as you can
directly free memory from big allocations.

* Ownership
Didn't look that much for it so might have missed it. Basicly what
happens if you free memory allocated in another arena, do you return
memory to the arena the memory was allocated from or to the current
threads arena? Not returning it to the original arena leads to cache
line sharing. Returning it to the original adds some overhead and it
might be argued that applications doing this might not be considered
scalable and should be fixed instead.

* Standalone version
Would be nice to have a standalone version to test the allocator on
solaris and linux to compare against their allocators. Would be nice for
all the people with only access to SMP machines running another OS. I
tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
helpful in testing scalability and impact on small machines aswell as
finding bugs in the implementation. I tested my allocator against
Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
and almost non working cpu fan. Beer as in standing in 10cm of beer in
my backpack =).
Not sure how much of the implementation is FreeBSD dependant, RB-trees?,
Solaris 9 has the benefit of using a local malloc inside libc which
makes the pthread library functions accessible from the malloc
implementation. I avoided touching the libc malloc as I really didn't
care at all about single threaded apps and simply used LD_PRELOAD to
override libc malloc. Basicly I throw away memory if you allocate a
multiple of the page size since I add a tag prepending the memory and I
allocate whole pages. So a 8192 byte allocation becomes a 8200 byte
allocation, wasting a page.

* Focus - Single thread vs multithread
What are the focus of the allocator, how much memory is it worth to
waste to implement a good scalable implementation. This is the reason I
think Solaris still keeps a serial allocator in libc and makes libumem
accessible through LD_PRELOAD and friends.

* Benchmarks
Would be really nice with a few benchmarks against phkmalloc, ptmalloc,
hoard and libumem to see how well it stands. Pure local arena allocation
apps, Larson benchmarks and some other. Might also be good to see how
much more memory it consumes compared to phkmalloc.

* Comment
Might be wise to fix the comment about always using mmap since brk can
be used.

* Conclussion
Despite not having tested it it looks good and would be nice to have it
in FreeBSD. Some parts are really independent on the allocator and would
be nice to have, regardless of the general consensus.

--
/Johan Bucht
bucht@...
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Dan Nelson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

In the last episode (Dec 11), Jason Evans said:

> On Dec 11, 2005, at 4:35 PM, Julian Elischer wrote:
> > Is there no way to make it an option for a while? that would get
> > good testing AND a fallback for people.
>
> Unfortunately, there are some low level issues that make the two
> malloc implementations incompatible, and they both need access to
> libc internals in order to work correctly in a multi-threaded
> program.  The way I have been comparing the two implementations is
> via chroot installations.  It might be possible to make the two
> compatible (would require extra coding), but since both of them need
> to be part of libc, we would need a way of building separate libc
> libraries for the two mallocs.  This all seems uglier than it's worth
> to me.  Maybe there's another way...

I have had good results by simply compiling malloc.c into a shared
library and loading it via LD_PRELOAD.  Works enough to run mozilla at
least.


--
        Dan Nelson
        dnelson@...
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Jason Evans :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Johan,

Thank you for your code review!  Specific responses follow.

On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
> Just a few comments, not sure if all of them are valid since I don't
> have a recent enough system to apply the patch and test it, so my
> comments are just based on reading the patch. I might have missed some
> thing hidden between all the lines differing. Can you put up a patched
> malloc.c somewhere?

http://www.canonware.com/~jasone/jemalloc/malloc.c

> * Fork safety functions
> Nice to have for all allocators and is something I missed having.  
> Would
> like to have them regardless if your malloc becomes standard or not.

I think that the implementation is currently fork-safe.  The threads  
libraries call _malloc_prefork() and _malloc_postfork() in order to  
avoid locking issues.

> * magazines, per-arena caches
> If I didn't miss something you don't seem to have it, pardon if I  
> missed
> them in my quick readings. They can really help the scalability and is
> something that is implemented in Solaris libumem allocator with great
> success. I implemented them and they helped bring the scalability up
> quite a bit on the system I tested my allocator on aswell as bringing
> the allocation latency down.

Arenas do use caches for objects below a size threshold (currently  
~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).  
The caching mechanism is quite different than magazines, but still  
very effective.

> * Saw you used a tree to look up the allocation sizes at free, this
> * differs from how I and Solaris libumem does it. Instead I went for
> * prepending a 8/16 byte tag containing the size of the allocation for
> * lockless size lookup.

Actually, my implementation prepends a 4 byte tag that contains  
allocated/free flags, and the size of the allocation.  The trees are  
only used to look up the sizes of huge allocations (> 8 MB on 32-bit  
platforms, and > 256 MB on 64-bit platforms).  Huge allocations start  
at chunk boundaries, so prepending a tag would require an entire  
extra page (and would cause potentially serious external  
fragmentation on 32-bit systems).

> * Bucket sizes
> If I didn't miss something you are using power of two sizes. Might be
> better to use smaller sizes to avoid internal fragmentation, aswell as
> improving locking granularity.

My implementation uses quantum-sized buckets (not power of two size  
classes).  The quantum size is either 8 or 16 bytes, depending on the  
machine architecture.

> * Locking granularity
> You use a single lock for the chunk allocation instead of a lock per
> bucket size or something similar. This means that you limit access
> unless threads allocate from their private arenas. Since you hash  
> to get
> an arena there might be multiple threads accessing the same arena  
> at the
> same time introducing contention. Might be better to have a lock per
> allocation size to somewhat improve the situation.

Using a lock per allocation size requires that slower algorithms be  
used in some places (and it certainly complicates certain other  
operations).  This didn't seem like a worthwhile tradeoff to me.  By  
making the code faster and simpler, less time is spent in the malloc  
code.  Unless an app does nothing but malloc/free, I don't expect  
this to be an issue.  Even microbenchmarks that do nothing but malloc/
free don't appear to suffer.

> * Locking primitive
> The biggest issue and as David Xu pointed out is probably the locking
> primitives. The SPINLOCK use has a limit in the threading library and
> makes is hard to have a lot of mutexes. I ended up using a wrapper
> around the umtx_lock function to get recursive mutexes and it would
> probably be better to extend the umtx functions to handle recursion.
> This would probably also be appreciated by other malloc  
> implementations.
> Might be interesting to implement some of the ideas from the Linux  
> futex
> implementation to help umtx.

I have been contemplating creating a separate spinlock API that  
doesn't require the threads library to track the spinlocks across  
fork.  This would (if I understand correctly) remove the current  
static spinlock limitations.

As for supporting recursive spinlocks, I doubt that the overhead  
would be acceptable in general.  If I could get rid of the need for  
the one recursive lock in malloc.c, I certainly would. =)

> * Big allocations
> It might be preferable to use mmap/munmap directly for big allocations
> (bigger than a few pages) to avoid the overhead of trying to be smart.
> These big allocations should be pretty few and shouldn't be that
> pessimized by this. It also means some memory savings as you can
> directly free memory from big allocations.

I don't really see the approach I'm taking as "trying to be smart",  
so much as "making it simple by using the same algorithm for almost  
everything".  Many other malloc implementations I've examined use  
three or more allocation approaches, depending on the allocation  
size.  jemalloc uses only two, and huge allocations really are huge,  
such that they rarely if ever happen.  In practice, I don't think  
there is a real memory savings to be had.  Besides, madvise() calls  
in strategic locations would achieve approximately the same result.

> * Ownership
> Didn't look that much for it so might have missed it. Basicly what
> happens if you free memory allocated in another arena, do you return
> memory to the arena the memory was allocated from or to the current
> threads arena? Not returning it to the original arena leads to cache
> line sharing. Returning it to the original adds some overhead and it
> might be argued that applications doing this might not be considered
> scalable and should be fixed instead.

Allocated objects can be associated with their parent arenas in  
constant time (by masking bits in the object addresses).  This makes  
it possible to always free memory back to the same arena it came from.

> * Standalone version
> Would be nice to have a standalone version to test the allocator on
> solaris and linux to compare against their allocators. Would be  
> nice for
> all the people with only access to SMP machines running another OS. I
> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
> helpful in testing scalability and impact on small machines aswell as
> finding bugs in the implementation. I tested my allocator against
> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
> and almost non working cpu fan. Beer as in standing in 10cm of beer in
> my backpack =).
> Not sure how much of the implementation is FreeBSD dependant, RB-
> trees?,

I actually have a separate more portable implementation of jemalloc  
that is part of a programming language runtime library (jemalloc is  
simply a FreeBSD-ized stand-alone version of that allocator).  I've  
tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris.  On Linux I've  
compared against ptmalloc2, dlmalloc, and hoard.  Unfortunately, that  
version of the allocator has to make some performance sacrifices  
regarding locking, since it doesn't have access to the pthreads  
internals.  As such, it requires a lot of care to interpret  
performance results, so I haven't felt it to be worthwhile providing  
that version here.

> * Focus - Single thread vs multithread
> What are the focus of the allocator, how much memory is it worth to
> waste to implement a good scalable implementation. This is the  
> reason I
> think Solaris still keeps a serial allocator in libc and makes libumem
> accessible through LD_PRELOAD and friends.

jemalloc only creates two arenas for single-threaded applications  
(one for internal memory management, and one for user requests), so  
it doesn't really sacrifice much in the case of single-threaded  
programs -- a few pages maybe.  In the case of multi-threaded  
programs, it potentially sacrifices memory compactness in order to  
reduce contention and cache line sharing.

> * Comment
> Might be wise to fix the comment about always using mmap since brk can
> be used.

Thanks, done.

Jason
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Doug Barton-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jason Evans wrote:

> It would be very informative to run benchmarks with real world
> multithreaded apps.  bind9 (built with threading support) would be a
> great candidate,

Maybe someday, but not at the moment. I've been told by the folks at ISC
that in it's current form, threading is a pessimization on all versions of
BIND 9 prior to the 9.4 code that they have in CVS (which has not been
generally released). Thus, I would not expect performance of BIND 9 with
threads to be an accurate reflection of the effectiveness of the library.

hth,

Doug

--

        If you're never wrong, you're not trying hard enough
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Parent Message unknown Re: New libc malloc patch

by dougb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jason Evans wrote:

> It would be very informative to run benchmarks with real world
> multithreaded apps.  bind9 (built with threading support) would be a
> great candidate,

Maybe someday, but not at the moment. I've been told by the folks at ISC
that in it's current form, threading is a pessimization on all versions of
BIND 9 prior to the 9.4 code that they have in CVS (which has not been
generally released). Thus, I would not expect performance of BIND 9 with
threads to be an accurate reflection of the effectiveness of the library.

hth,

Doug

--

     This .signature sanitized for your protection


_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Johan Bucht :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jason Evans wrote:

> Johan,
>
> Thank you for your code review!  Specific responses follow.

Wouldn't exactly call scrolling through a diff at 3 AM review =).

>
> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
>>
>> Just a few comments, not sure if all of them are valid since I don't
>> have a recent enough system to apply the patch and test it, so my
>> comments are just based on reading the patch. I might have missed some
>> thing hidden between all the lines differing. Can you put up a patched
>> malloc.c somewhere?
>
>
> http://www.canonware.com/~jasone/jemalloc/malloc.c
>
Thanks!

>> * Fork safety functions
>> Nice to have for all allocators and is something I missed having.  Would
>> like to have them regardless if your malloc becomes standard or not.
>
>
> I think that the implementation is currently fork-safe.  The threads  
> libraries call _malloc_prefork() and _malloc_postfork() in order to  
> avoid locking issues.
>
Hmm, I meant that the _malloc_prefork() functions are independent from
your malloc allocation and I would like them committed regardless as
they make the life easier for other malloc implementation.

>> * magazines, per-arena caches
>> If I didn't miss something you don't seem to have it, pardon if I  
>> missed
>> them in my quick readings. They can really help the scalability and is
>> something that is implemented in Solaris libumem allocator with great
>> success. I implemented them and they helped bring the scalability up
>> quite a bit on the system I tested my allocator on aswell as bringing
>> the allocation latency down.
>
>
> Arenas do use caches for objects below a size threshold (currently  
> ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).  
> The caching mechanism is quite different than magazines, but still  
> very effective.
>
Ah nice, shows how little review I did. =)
Will look a bit more at this then.

>> * Saw you used a tree to look up the allocation sizes at free, this
>> * differs from how I and Solaris libumem does it. Instead I went for
>> * prepending a 8/16 byte tag containing the size of the allocation for
>> * lockless size lookup.
>
>
> Actually, my implementation prepends a 4 byte tag that contains  
> allocated/free flags, and the size of the allocation.  The trees are  
> only used to look up the sizes of huge allocations (> 8 MB on 32-bit  
> platforms, and > 256 MB on 64-bit platforms).  Huge allocations start  
> at chunk boundaries, so prepending a tag would require an entire  
> extra page (and would cause potentially serious external  
> fragmentation on 32-bit systems).
>
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)


>> * Bucket sizes
>> If I didn't miss something you are using power of two sizes. Might be
>> better to use smaller sizes to avoid internal fragmentation, aswell as
>> improving locking granularity.
>
>
> My implementation uses quantum-sized buckets (not power of two size  
> classes).  The quantum size is either 8 or 16 bytes, depending on the  
> machine architecture.
>
Ahh sorry, just saw the pow stuff and assumed it was power of two for
the buckets.

>> * Locking granularity
>> You use a single lock for the chunk allocation instead of a lock per
>> bucket size or something similar. This means that you limit access
>> unless threads allocate from their private arenas. Since you hash  to
>> get
>> an arena there might be multiple threads accessing the same arena  at
>> the
>> same time introducing contention. Might be better to have a lock per
>> allocation size to somewhat improve the situation.
>
>
> Using a lock per allocation size requires that slower algorithms be  
> used in some places (and it certainly complicates certain other  
> operations).  This didn't seem like a worthwhile tradeoff to me.  By  
> making the code faster and simpler, less time is spent in the malloc  
> code.  Unless an app does nothing but malloc/free, I don't expect  
> this to be an issue.  Even microbenchmarks that do nothing but malloc/
> free don't appear to suffer.
>
Yeah, not sure how much of an impact it has on your allocator it has,
just something that improved my allocator as I move magazines between
local arenas and the global.

>> * Locking primitive
>> The biggest issue and as David Xu pointed out is probably the locking
>> primitives. The SPINLOCK use has a limit in the threading library and
>> makes is hard to have a lot of mutexes. I ended up using a wrapper
>> around the umtx_lock function to get recursive mutexes and it would
>> probably be better to extend the umtx functions to handle recursion.
>> This would probably also be appreciated by other malloc  
>> implementations.
>> Might be interesting to implement some of the ideas from the Linux  
>> futex
>> implementation to help umtx.
>
>
> I have been contemplating creating a separate spinlock API that  
> doesn't require the threads library to track the spinlocks across  
> fork.  This would (if I understand correctly) remove the current  
> static spinlock limitations.
>
> As for supporting recursive spinlocks, I doubt that the overhead  
> would be acceptable in general.  If I could get rid of the need for  
> the one recursive lock in malloc.c, I certainly would. =)
>
Yeah, made some attempts to remove my recursive locks in the global
arena but it made the code more complicated and didn't really do much.

>> * Big allocations
>> It might be preferable to use mmap/munmap directly for big allocations
>> (bigger than a few pages) to avoid the overhead of trying to be smart.
>> These big allocations should be pretty few and shouldn't be that
>> pessimized by this. It also means some memory savings as you can
>> directly free memory from big allocations.
>
>
> I don't really see the approach I'm taking as "trying to be smart",  
> so much as "making it simple by using the same algorithm for almost  
> everything".  Many other malloc implementations I've examined use  
> three or more allocation approaches, depending on the allocation  
> size.  jemalloc uses only two, and huge allocations really are huge,  
> such that they rarely if ever happen.  In practice, I don't think  
> there is a real memory savings to be had.  Besides, madvise() calls  
> in strategic locations would achieve approximately the same result.
>
Yeah, saving them for future use through madvise might be better. I
chose the route to keep the big allocations simple and avoid caching and
looking up stuff for.
Basicly read the size and if it's big munmap it directly.

>> * Ownership
>> Didn't look that much for it so might have missed it. Basicly what
>> happens if you free memory allocated in another arena, do you return
>> memory to the arena the memory was allocated from or to the current
>> threads arena? Not returning it to the original arena leads to cache
>> line sharing. Returning it to the original adds some overhead and it
>> might be argued that applications doing this might not be considered
>> scalable and should be fixed instead.
>
>
> Allocated objects can be associated with their parent arenas in  
> constant time (by masking bits in the object addresses).  This makes  
> it possible to always free memory back to the same arena it came from.
>
Ok, good.

>> * Standalone version
>> Would be nice to have a standalone version to test the allocator on
>> solaris and linux to compare against their allocators. Would be  nice
>> for
>> all the people with only access to SMP machines running another OS. I
>> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
>> helpful in testing scalability and impact on small machines aswell as
>> finding bugs in the implementation. I tested my allocator against
>> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
>> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
>> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
>> and almost non working cpu fan. Beer as in standing in 10cm of beer in
>> my backpack =).
>> Not sure how much of the implementation is FreeBSD dependant, RB-
>> trees?,
>
>
> I actually have a separate more portable implementation of jemalloc  
> that is part of a programming language runtime library (jemalloc is  
> simply a FreeBSD-ized stand-alone version of that allocator).  I've  
> tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris.  On Linux I've  
> compared against ptmalloc2, dlmalloc, and hoard.  Unfortunately, that  
> version of the allocator has to make some performance sacrifices  
> regarding locking, since it doesn't have access to the pthreads  
> internals.  As such, it requires a lot of care to interpret  
> performance results, so I haven't felt it to be worthwhile providing  
> that version here.
>
ok

>> * Focus - Single thread vs multithread
>> What are the focus of the allocator, how much memory is it worth to
>> waste to implement a good scalable implementation. This is the  reason I
>> think Solaris still keeps a serial allocator in libc and makes libumem
>> accessible through LD_PRELOAD and friends.
>
>
> jemalloc only creates two arenas for single-threaded applications  
> (one for internal memory management, and one for user requests), so  
> it doesn't really sacrifice much in the case of single-threaded  
> programs -- a few pages maybe.  In the case of multi-threaded  
> programs, it potentially sacrifices memory compactness in order to  
> reduce contention and cache line sharing.
>
As I have not tested it I take your word. Mostly wanted to know where
you stand in the issue of pessimizing single-threaded apps vs MT. And if
it's worth keeping phkmalloc for some apps.

>> * Comment
>> Might be wise to fix the comment about always using mmap since brk can
>> be used.
>
>
> Thanks, done.
>
> Jason


No problem, always fun to have some code that you actually understand
the idea behind and see how it's implemented.

/Johan Bucht
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Johan Bucht :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Johan Bucht wrote:

>>> * Fork safety functions
>>> Nice to have for all allocators and is something I missed having.  
>>> Would
>>> like to have them regardless if your malloc becomes standard or not.
>>
>>
>>
>> I think that the implementation is currently fork-safe.  The threads  
>> libraries call _malloc_prefork() and _malloc_postfork() in order to  
>> avoid locking issues.
>>
> Hmm, I meant that the _malloc_prefork() functions are independent from
> your malloc allocation and I would like them committed regardless as
> they make the life easier for other malloc implementation.


Bah, the libc bits not the actual functions of course. =)
Should probably go to sleep instead of writing semi-understandable
sentences.

/Johan Bucht
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Jason Evans :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Dec 11, 2005, at 7:16 PM, Johan Bucht wrote:

> Jason Evans wrote:
>>
>> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>>> * Saw you used a tree to look up the allocation sizes at free, this
>>> * differs from how I and Solaris libumem does it. Instead I went for
>>> * prepending a 8/16 byte tag containing the size of the  
>>> allocation for
>>> * lockless size lookup.
>>
>> Actually, my implementation prepends a 4 byte tag that contains  
>> allocated/free flags, and the size of the allocation.  The trees  
>> are  only used to look up the sizes of huge allocations (> 8 MB on  
>> 32-bit  platforms, and > 256 MB on 64-bit platforms).  Huge  
>> allocations start  at chunk boundaries, so prepending a tag would  
>> require an entire  extra page (and would cause potentially serious  
>> external  fragmentation on 32-bit systems).
>>
> Isn't 8 byte alignment expected by some applications?

Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion  
that we should be using 16 byte alignment on i386 due to SSE2/SSE3).  
If I understand your question correctly, you're asking how I get away  
with 4 byte tags.  Consider that (assuming 8-byte quantum) a malloc
(16) call must actually be serviced by at least 24 bytes internally,  
in order to pad to the next quantum size.  If I used 8 byte tags,  
then malloc(17) would require 32 bytes internally.  By using 4 byte  
tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally.  
At least one implementation (the OS X malloc implementation) uses 2  
byte tags, but this has two problems: 1) unaligned memory access is  
slow on some architectures, and 2) it's too small to handle large  
object sizes, so a different allocation strategy has to be used  
starting at ~12 KB.

> How do you know if a allocation is huge if you don't have a tag?

I know an allocation is huge if its base address is chunk-aligned.  
The actual size is stored in a red-black tree node, which is  
allocated separately.

Jason
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Johan Bucht :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jason Evans wrote:

>> Isn't 8 byte alignment expected by some applications?
>
>
> Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion  
> that we should be using 16 byte alignment on i386 due to SSE2/SSE3).  
> If I understand your question correctly, you're asking how I get away  
> with 4 byte tags.  Consider that (assuming 8-byte quantum) a malloc
> (16) call must actually be serviced by at least 24 bytes internally,  
> in order to pad to the next quantum size.  If I used 8 byte tags,  
> then malloc(17) would require 32 bytes internally.  By using 4 byte  
> tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally.  
> At least one implementation (the OS X malloc implementation) uses 2  
> byte tags, but this has two problems: 1) unaligned memory access is  
> slow on some architectures, and 2) it's too small to handle large  
> object sizes, so a different allocation strategy has to be used  
> starting at ~12 KB.
>
Well, I just wondered how you avoided unaligned accesses with a 4 byte tag.

>> How do you know if a allocation is huge if you don't have a tag?
>
>
> I know an allocation is huge if its base address is chunk-aligned.  
> The actual size is stored in a red-black tree node, which is  
> allocated separately.

Ok, expected it was through the address, thanks for answering anyway.  
Gonna take some time reading through the code before asking more stupid
questions. =)

/Johan Bucht
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Kris Kennaway :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Dec 11, 2005 at 08:29:07PM -0500, Kris Kennaway wrote:

> I'll try to test this on a 4 CPU amd64 machine next.

phkmalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
 Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
 Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes.
 Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
 Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes.
 Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes.
 Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes.

jemalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
 Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
 Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes.
 Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
 Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes.
 Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes.
 Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
 Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes.
 Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes.
 Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes.
 Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
 Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes.
 Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes.
 Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes.
 Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes.
 Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes.

So it's 1.1 times faster for single-threaded, and 107 times faster
with 3 threads.

With libthr instead of libpthread:

phkmalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
 Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
 Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes.
 Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
 Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes.
 Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes.
 Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
 Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes.
 Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes.
 Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes.
 Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
 Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes.
 Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes.
 Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes.
 Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes.
 Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes.

jemalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
 Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
 Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
 Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
 Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
 Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes.
 Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 6
Starting test with 6 threads...
 Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes.
 Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes.
 Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 20
Starting test with 20 threads...
 Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes.
 Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes.
 Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes.
 Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes.
 Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes.
 Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes.
 Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes.
 Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes.
 Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes.
 Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes.
 Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes.
 Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes.
 Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes.
c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes.
 Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes.
 Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes.
 Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes.
 Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes.

So libthr is *much* faster than libpthread with both malloc
implementations, but jemalloc is still 1.7 times faster for 1 thread
and 80 times faster for 5 threads than phkmalloc.

Kris

P.S. Holy crap :)


attachment0 (194 bytes) Download Attachment

Re: New libc malloc patch

by Johan Bucht :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kris Kennaway wrote:

>On Sun, Dec 11, 2005 at 08:29:07PM -0500, Kris Kennaway wrote:
>
>  
>
>>I'll try to test this on a 4 CPU amd64 machine next.
>>    
>>
>
>  
>
Thanks for the time.

>phkmalloc:
>
># ./malloc-test 1024 10000000 1
>Starting test with 1 thread...
> Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 2
>Starting test with 2 threads...
> Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes.
> Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 3
>Starting test with 3 threads...
> Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes.
> Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes.
> Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes.
>
>  
>
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?

>jemalloc:
>
># ./malloc-test 1024 10000000 1
>Starting test with 1 thread...
> Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 2
>Starting test with 2 threads...
> Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes.
> Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 3
>Starting test with 3 threads...
> Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes.
> Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes.
> Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 4
>Starting test with 4 threads...
> Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes.
> Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes.
> Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes.
> Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 5
>Starting test with 5 threads...
> Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes.
> Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes.
> Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes.
> Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes.
> Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes.
>
>So it's 1.1 times faster for single-threaded, and 107 times faster
>with 3 threads.
>
>  
>
The problem with thread scheduling under 4bsd as reported earlier making
the first thread getting higher priority than the later threads, makes
these numbers a bit strange.

>With libthr instead of libpthread:
>
>phkmalloc:
>
># ./malloc-test 1024 10000000 1
>Starting test with 1 thread...
> Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 2
>Starting test with 2 threads...
> Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes.
> Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 3
>Starting test with 3 threads...
> Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes.
> Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes.
> Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 4
>Starting test with 4 threads...
> Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes.
> Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes.
> Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes.
> Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 5
>Starting test with 5 threads...
> Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes.
> Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes.
> Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes.
> Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes.
> Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes.
>
>  
>
Looks reasonable for a serial allocator.

>jemalloc:
>
># ./malloc-test 1024 10000000 1
>Starting test with 1 thread...
> Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 2
>Starting test with 2 threads...
> Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 3
>Starting test with 3 threads...
> Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 4
>Starting test with 4 threads...
> Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 5
>Starting test with 5 threads...
> Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes.
> Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 6
>Starting test with 6 threads...
> Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes.
> Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes.
> Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes.
># ./malloc-test 1024 10000000 20
>Starting test with 20 threads...
> Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes.
> Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes.
> Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes.
> Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes.
> Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes.
> Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes.
> Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes.
> Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes.
> Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes.
> Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes.
> Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes.
> Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes.
> Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes.
>c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes.
> Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes.
> Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes.
> Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes.
> Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes.
>
>  
>
Seems to experience the same scheduling issues but to a lesser extent.


>So libthr is *much* faster than libpthread with both malloc
>implementations, but jemalloc is still 1.7 times faster for 1 thread
>and 80 times faster for 5 threads than phkmalloc.
>
>  
>
This test simply tests the local arena performance making it the worst
case for serial allocators as all threads contend for the same lock.
At the same time this is the best case scenario for jemalloc as all
memory resides in the local arena. This means no contention at all
unless the threads get hashed into the same arena.
Basicly you are comparing worst case of phkmalloc vs best case of
jemalloc. =)

Would be nice if someone could run some supersmack benchmarks.

>Kris
>
>P.S. Holy crap :)
>  
>

_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by David Xu-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Johan Bucht wrote:
>
>>
> Those times seems way too high even for a serial allocator, is
> libpthread performance really this bad on amd64 or is it broken?
>
Not sure which thread library has been tested.


_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Kris Kennaway :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote:
> Johan Bucht wrote:
> >
> >>
> >Those times seems way too high even for a serial allocator, is
> >libpthread performance really this bad on amd64 or is it broken?
> >
> Not sure which thread library has been tested.

The quoted text referred to libpthread, the default on amd64.

Kris


attachment0 (194 bytes) Download Attachment

Re: New libc malloc patch

by Daniel Eischen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, 11 Dec 2005, Jason Evans wrote:

> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
> > * Locking primitive
> > The biggest issue and as David Xu pointed out is probably the locking
> > primitives. The SPINLOCK use has a limit in the threading library and
> > makes is hard to have a lot of mutexes. I ended up using a wrapper
> > around the umtx_lock function to get recursive mutexes and it would
> > probably be better to extend the umtx functions to handle recursion.
> > This would probably also be appreciated by other malloc
> > implementations.
> > Might be interesting to implement some of the ideas from the Linux
> > futex
> > implementation to help umtx.
>
> I have been contemplating creating a separate spinlock API that
> doesn't require the threads library to track the spinlocks across
> fork.  This would (if I understand correctly) remove the current
> static spinlock limitations.

What about using pthread_atfork()?

> As for supporting recursive spinlocks, I doubt that the overhead
> would be acceptable in general.  If I could get rid of the need for
> the one recursive lock in malloc.c, I certainly would. =)

Why do we need a recursive mutex?  Can you not restructure the
code so that it is not needed?

--
DE

_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Jason Evans :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Dec 11, 2005, at 9:58 PM, Daniel Eischen wrote:
> On Sun, 11 Dec 2005, Jason Evans wrote:
>> I have been contemplating creating a separate spinlock API that
>> doesn't require the threads library to track the spinlocks across
>> fork.  This would (if I understand correctly) remove the current
>> static spinlock limitations.
>
> What about using pthread_atfork()?

Aren't there potential ordering issues for that?  It seems to me that  
the malloc pre-fork code would need to be run after any other pre-
fork functions, in order to avoid potential deadlock, and that the  
malloc post-fork code would need to be run before any other post-fork  
functions, again to avoid potential deadlock.

After looking at the spinlock code some more, it's no longer clear to  
me why the thread/thr_spinlock.c code uses a static array for  
spinlocks.  It seems to me that it would work fine to allow the  
program to provide space for a spinlock and manually initialize it.  
This would remove the limitation on the number of spinlocks.

>> As for supporting recursive spinlocks, I doubt that the overhead
>> would be acceptable in general.  If I could get rid of the need for
>> the one recursive lock in malloc.c, I certainly would. =)
>
> Why do we need a recursive mutex?  Can you not restructure the
> code so that it is not needed?

There is an internal arena that the malloc code uses for allocating  
internal data structures.  In some cases, the internal arena has to  
recursively allocate.  If there were no object caching, it might be  
possible to pre-allocate, such that recursion never happens, but  
given the object caching, it's difficult to reason about precisely  
what will occur internally for a single malloc/free operation.  There  
are some other possibilities, but nothing I've thought of so far is  
simple or elegant.

Fixing this would make all locking in the malloc code a bit cheaper,  
which is why it continues to bother me.

Jason
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Peter Jeremy :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, 2005-Dec-11 16:04:09 -0800, Jason Evans wrote:
>early, during the load of the libbitmap module.  My best guess is  
>that it is stuck trying to acquire the Xlib lock, but cannot be sure,  
>since I don't know how to get debug symbols for the loaded X module.  

I've run into a related problem in the past.  Eric Anholt's advice was:
>You could compile all the modules into the server by adding "#define
>DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server
>(see scripts/configure for other defines that are added).

Something similar probably works in X.org.  I found that I had to
remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'.

--
Peter Jeremy
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by David Xu-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kris Kennaway wrote:

> On Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote:
>
>>Johan Bucht wrote:
>>
>>>Those times seems way too high even for a serial allocator, is
>>>libpthread performance really this bad on amd64 or is it broken?
>>>
>>
>>Not sure which thread library has been tested.
>
>
> The quoted text referred to libpthread, the default on amd64.
>
> Kris
May try libthr too ? :-)

_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: New libc malloc patch

by Kris Kennaway :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Dec 12, 2005 at 03:18:01PM +0800, David Xu wrote:

> Kris Kennaway wrote:
> >On Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote:
> >
> >>Johan Bucht wrote:
> >>
> >>>Those times seems way too high even for a serial allocator, is
> >>>libpthread performance really this bad on amd64 or is it broken?
> >>>
> >>
> >>Not sure which thread library has been tested.
> >
> >
> >The quoted text referred to libpthread, the default on amd64.
> >
> >Kris
> May try libthr too ? :-)
That was in the other half of the mail, I guess you didn't read it
thoroughly :)

Kris



attachment0 (194 bytes) Download Attachment

Re: New libc malloc patch

by Eric Anholt :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 2005-12-12 at 18:02 +1100, Peter Jeremy wrote:

> On Sun, 2005-Dec-11 16:04:09 -0800, Jason Evans wrote:
> >early, during the load of the libbitmap module.  My best guess is  
> >that it is stuck trying to acquire the Xlib lock, but cannot be sure,  
> >since I don't know how to get debug symbols for the loaded X module.  
>
> I've run into a related problem in the past.  Eric Anholt's advice was:
> >You could compile all the modules into the server by adding "#define
> >DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server
> >(see scripts/configure for other defines that are added).
>
> Something similar probably works in X.org.  I found that I had to
> remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'.
DoLoadableServer NO may or may not work at any time, and frequently
changes bug reproducibility due to the lack of the loader process.  If
you can reproduce with xorg-server-snap, then WITH_DEBUG=1 during
compile will get you a server with debug symbols.

--
Eric Anholt                                     eta@...
http://people.freebsd.org/~anholt/              anholt@...


signature.asc (194 bytes) Download Attachment

Re: New libc malloc patch

by Doug Rabson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Monday 12 December 2005 03:16, Johan Bucht wrote:
> Isn't 8 byte alignment expected by some applications?
> How do you know if a allocation is huge if you don't have a tag?
> Something more to read up on i guess. =)

Actually, I'm glad you pointed that out. My own applications use SSE2
primitives a lot and those guys need to be allocated on 16-byte
boundaries. I currently use phkmalloc which has the nice property that
small allocations are aligned to the next greater power-of-2 boundary
which is (for me) always at least 16.

Since you are targetting modern architectures with this allocator, could
you please set the minimum alignment on all i386 and amd64 processors
to 16. For what its worth, when gcc is using SSE instructions, it
assumes this for all memory, stack or malloc.
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."
< Prev | 1 - 2 - 3 - 4 | Next >