|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 | Next > |
|
|
|
|
|
Re: New libc malloc patchIn the last episode (Dec 11), Jason Evans said:
> On Dec 11, 2005, at 4:35 PM, Julian Elischer wrote: > > Is there no way to make it an option for a while? that would get > > good testing AND a fallback for people. > > Unfortunately, there are some low level issues that make the two > malloc implementations incompatible, and they both need access to > libc internals in order to work correctly in a multi-threaded > program. The way I have been comparing the two implementations is > via chroot installations. It might be possible to make the two > compatible (would require extra coding), but since both of them need > to be part of libc, we would need a way of building separate libc > libraries for the two mallocs. This all seems uglier than it's worth > to me. Maybe there's another way... I have had good results by simply compiling malloc.c into a shared library and loading it via LD_PRELOAD. Works enough to run mozilla at least. -- Dan Nelson dnelson@... _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchJohan,
Thank you for your code review! Specific responses follow. On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: > > Just a few comments, not sure if all of them are valid since I don't > have a recent enough system to apply the patch and test it, so my > comments are just based on reading the patch. I might have missed some > thing hidden between all the lines differing. Can you put up a patched > malloc.c somewhere? http://www.canonware.com/~jasone/jemalloc/malloc.c > * Fork safety functions > Nice to have for all allocators and is something I missed having. > Would > like to have them regardless if your malloc becomes standard or not. I think that the implementation is currently fork-safe. The threads libraries call _malloc_prefork() and _malloc_postfork() in order to avoid locking issues. > * magazines, per-arena caches > If I didn't miss something you don't seem to have it, pardon if I > missed > them in my quick readings. They can really help the scalability and is > something that is implemented in Solaris libumem allocator with great > success. I implemented them and they helped bring the scalability up > quite a bit on the system I tested my allocator on aswell as bringing > the allocation latency down. Arenas do use caches for objects below a size threshold (currently ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems). The caching mechanism is quite different than magazines, but still very effective. > * Saw you used a tree to look up the allocation sizes at free, this > * differs from how I and Solaris libumem does it. Instead I went for > * prepending a 8/16 byte tag containing the size of the allocation for > * lockless size lookup. Actually, my implementation prepends a 4 byte tag that contains allocated/free flags, and the size of the allocation. The trees are only used to look up the sizes of huge allocations (> 8 MB on 32-bit platforms, and > 256 MB on 64-bit platforms). Huge allocations start at chunk boundaries, so prepending a tag would require an entire extra page (and would cause potentially serious external fragmentation on 32-bit systems). > * Bucket sizes > If I didn't miss something you are using power of two sizes. Might be > better to use smaller sizes to avoid internal fragmentation, aswell as > improving locking granularity. My implementation uses quantum-sized buckets (not power of two size classes). The quantum size is either 8 or 16 bytes, depending on the machine architecture. > * Locking granularity > You use a single lock for the chunk allocation instead of a lock per > bucket size or something similar. This means that you limit access > unless threads allocate from their private arenas. Since you hash > to get > an arena there might be multiple threads accessing the same arena > at the > same time introducing contention. Might be better to have a lock per > allocation size to somewhat improve the situation. Using a lock per allocation size requires that slower algorithms be used in some places (and it certainly complicates certain other operations). This didn't seem like a worthwhile tradeoff to me. By making the code faster and simpler, less time is spent in the malloc code. Unless an app does nothing but malloc/free, I don't expect this to be an issue. Even microbenchmarks that do nothing but malloc/ free don't appear to suffer. > * Locking primitive > The biggest issue and as David Xu pointed out is probably the locking > primitives. The SPINLOCK use has a limit in the threading library and > makes is hard to have a lot of mutexes. I ended up using a wrapper > around the umtx_lock function to get recursive mutexes and it would > probably be better to extend the umtx functions to handle recursion. > This would probably also be appreciated by other malloc > implementations. > Might be interesting to implement some of the ideas from the Linux > futex > implementation to help umtx. I have been contemplating creating a separate spinlock API that doesn't require the threads library to track the spinlocks across fork. This would (if I understand correctly) remove the current static spinlock limitations. As for supporting recursive spinlocks, I doubt that the overhead would be acceptable in general. If I could get rid of the need for the one recursive lock in malloc.c, I certainly would. =) > * Big allocations > It might be preferable to use mmap/munmap directly for big allocations > (bigger than a few pages) to avoid the overhead of trying to be smart. > These big allocations should be pretty few and shouldn't be that > pessimized by this. It also means some memory savings as you can > directly free memory from big allocations. I don't really see the approach I'm taking as "trying to be smart", so much as "making it simple by using the same algorithm for almost everything". Many other malloc implementations I've examined use three or more allocation approaches, depending on the allocation size. jemalloc uses only two, and huge allocations really are huge, such that they rarely if ever happen. In practice, I don't think there is a real memory savings to be had. Besides, madvise() calls in strategic locations would achieve approximately the same result. > * Ownership > Didn't look that much for it so might have missed it. Basicly what > happens if you free memory allocated in another arena, do you return > memory to the arena the memory was allocated from or to the current > threads arena? Not returning it to the original arena leads to cache > line sharing. Returning it to the original adds some overhead and it > might be argued that applications doing this might not be considered > scalable and should be fixed instead. Allocated objects can be associated with their parent arenas in constant time (by masking bits in the object addresses). This makes it possible to always free memory back to the same arena it came from. > * Standalone version > Would be nice to have a standalone version to test the allocator on > solaris and linux to compare against their allocators. Would be > nice for > all the people with only access to SMP machines running another OS. I > tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both > helpful in testing scalability and impact on small machines aswell as > finding bugs in the implementation. I tested my allocator against > Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x > UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my > allocator vs the FreeBSD libc on my 400MHz laptop with too much beer > and almost non working cpu fan. Beer as in standing in 10cm of beer in > my backpack =). > Not sure how much of the implementation is FreeBSD dependant, RB- > trees?, I actually have a separate more portable implementation of jemalloc that is part of a programming language runtime library (jemalloc is simply a FreeBSD-ized stand-alone version of that allocator). I've tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that version of the allocator has to make some performance sacrifices regarding locking, since it doesn't have access to the pthreads internals. As such, it requires a lot of care to interpret performance results, so I haven't felt it to be worthwhile providing that version here. > * Focus - Single thread vs multithread > What are the focus of the allocator, how much memory is it worth to > waste to implement a good scalable implementation. This is the > reason I > think Solaris still keeps a serial allocator in libc and makes libumem > accessible through LD_PRELOAD and friends. jemalloc only creates two arenas for single-threaded applications (one for internal memory management, and one for user requests), so it doesn't really sacrifice much in the case of single-threaded programs -- a few pages maybe. In the case of multi-threaded programs, it potentially sacrifices memory compactness in order to reduce contention and cache line sharing. > * Comment > Might be wise to fix the comment about always using mmap since brk can > be used. Thanks, done. Jason _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchJason Evans wrote:
> It would be very informative to run benchmarks with real world > multithreaded apps. bind9 (built with threading support) would be a > great candidate, Maybe someday, but not at the moment. I've been told by the folks at ISC that in it's current form, threading is a pessimization on all versions of BIND 9 prior to the 9.4 code that they have in CVS (which has not been generally released). Thus, I would not expect performance of BIND 9 with threads to be an accurate reflection of the effectiveness of the library. hth, Doug -- If you're never wrong, you're not trying hard enough _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
|
|
|
Re: New libc malloc patchJason Evans wrote:
> Johan, > > Thank you for your code review! Specific responses follow. Wouldn't exactly call scrolling through a diff at 3 AM review =). > > On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: > >> >> Just a few comments, not sure if all of them are valid since I don't >> have a recent enough system to apply the patch and test it, so my >> comments are just based on reading the patch. I might have missed some >> thing hidden between all the lines differing. Can you put up a patched >> malloc.c somewhere? > > > http://www.canonware.com/~jasone/jemalloc/malloc.c > >> * Fork safety functions >> Nice to have for all allocators and is something I missed having. Would >> like to have them regardless if your malloc becomes standard or not. > > > I think that the implementation is currently fork-safe. The threads > libraries call _malloc_prefork() and _malloc_postfork() in order to > avoid locking issues. > Hmm, I meant that the _malloc_prefork() functions are independent from your malloc allocation and I would like them committed regardless as they make the life easier for other malloc implementation. >> * magazines, per-arena caches >> If I didn't miss something you don't seem to have it, pardon if I >> missed >> them in my quick readings. They can really help the scalability and is >> something that is implemented in Solaris libumem allocator with great >> success. I implemented them and they helped bring the scalability up >> quite a bit on the system I tested my allocator on aswell as bringing >> the allocation latency down. > > > Arenas do use caches for objects below a size threshold (currently > ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems). > The caching mechanism is quite different than magazines, but still > very effective. > Will look a bit more at this then. >> * Saw you used a tree to look up the allocation sizes at free, this >> * differs from how I and Solaris libumem does it. Instead I went for >> * prepending a 8/16 byte tag containing the size of the allocation for >> * lockless size lookup. > > > Actually, my implementation prepends a 4 byte tag that contains > allocated/free flags, and the size of the allocation. The trees are > only used to look up the sizes of huge allocations (> 8 MB on 32-bit > platforms, and > 256 MB on 64-bit platforms). Huge allocations start > at chunk boundaries, so prepending a tag would require an entire > extra page (and would cause potentially serious external > fragmentation on 32-bit systems). > How do you know if a allocation is huge if you don't have a tag? Something more to read up on i guess. =) >> * Bucket sizes >> If I didn't miss something you are using power of two sizes. Might be >> better to use smaller sizes to avoid internal fragmentation, aswell as >> improving locking granularity. > > > My implementation uses quantum-sized buckets (not power of two size > classes). The quantum size is either 8 or 16 bytes, depending on the > machine architecture. > the buckets. >> * Locking granularity >> You use a single lock for the chunk allocation instead of a lock per >> bucket size or something similar. This means that you limit access >> unless threads allocate from their private arenas. Since you hash to >> get >> an arena there might be multiple threads accessing the same arena at >> the >> same time introducing contention. Might be better to have a lock per >> allocation size to somewhat improve the situation. > > > Using a lock per allocation size requires that slower algorithms be > used in some places (and it certainly complicates certain other > operations). This didn't seem like a worthwhile tradeoff to me. By > making the code faster and simpler, less time is spent in the malloc > code. Unless an app does nothing but malloc/free, I don't expect > this to be an issue. Even microbenchmarks that do nothing but malloc/ > free don't appear to suffer. > just something that improved my allocator as I move magazines between local arenas and the global. >> * Locking primitive >> The biggest issue and as David Xu pointed out is probably the locking >> primitives. The SPINLOCK use has a limit in the threading library and >> makes is hard to have a lot of mutexes. I ended up using a wrapper >> around the umtx_lock function to get recursive mutexes and it would >> probably be better to extend the umtx functions to handle recursion. >> This would probably also be appreciated by other malloc >> implementations. >> Might be interesting to implement some of the ideas from the Linux >> futex >> implementation to help umtx. > > > I have been contemplating creating a separate spinlock API that > doesn't require the threads library to track the spinlocks across > fork. This would (if I understand correctly) remove the current > static spinlock limitations. > > As for supporting recursive spinlocks, I doubt that the overhead > would be acceptable in general. If I could get rid of the need for > the one recursive lock in malloc.c, I certainly would. =) > arena but it made the code more complicated and didn't really do much. >> * Big allocations >> It might be preferable to use mmap/munmap directly for big allocations >> (bigger than a few pages) to avoid the overhead of trying to be smart. >> These big allocations should be pretty few and shouldn't be that >> pessimized by this. It also means some memory savings as you can >> directly free memory from big allocations. > > > I don't really see the approach I'm taking as "trying to be smart", > so much as "making it simple by using the same algorithm for almost > everything". Many other malloc implementations I've examined use > three or more allocation approaches, depending on the allocation > size. jemalloc uses only two, and huge allocations really are huge, > such that they rarely if ever happen. In practice, I don't think > there is a real memory savings to be had. Besides, madvise() calls > in strategic locations would achieve approximately the same result. > chose the route to keep the big allocations simple and avoid caching and looking up stuff for. Basicly read the size and if it's big munmap it directly. >> * Ownership >> Didn't look that much for it so might have missed it. Basicly what >> happens if you free memory allocated in another arena, do you return >> memory to the arena the memory was allocated from or to the current >> threads arena? Not returning it to the original arena leads to cache >> line sharing. Returning it to the original adds some overhead and it >> might be argued that applications doing this might not be considered >> scalable and should be fixed instead. > > > Allocated objects can be associated with their parent arenas in > constant time (by masking bits in the object addresses). This makes > it possible to always free memory back to the same arena it came from. > >> * Standalone version >> Would be nice to have a standalone version to test the allocator on >> solaris and linux to compare against their allocators. Would be nice >> for >> all the people with only access to SMP machines running another OS. I >> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both >> helpful in testing scalability and impact on small machines aswell as >> finding bugs in the implementation. I tested my allocator against >> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x >> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my >> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer >> and almost non working cpu fan. Beer as in standing in 10cm of beer in >> my backpack =). >> Not sure how much of the implementation is FreeBSD dependant, RB- >> trees?, > > > I actually have a separate more portable implementation of jemalloc > that is part of a programming language runtime library (jemalloc is > simply a FreeBSD-ized stand-alone version of that allocator). I've > tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've > compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that > version of the allocator has to make some performance sacrifices > regarding locking, since it doesn't have access to the pthreads > internals. As such, it requires a lot of care to interpret > performance results, so I haven't felt it to be worthwhile providing > that version here. > >> * Focus - Single thread vs multithread >> What are the focus of the allocator, how much memory is it worth to >> waste to implement a good scalable implementation. This is the reason I >> think Solaris still keeps a serial allocator in libc and makes libumem >> accessible through LD_PRELOAD and friends. > > > jemalloc only creates two arenas for single-threaded applications > (one for internal memory management, and one for user requests), so > it doesn't really sacrifice much in the case of single-threaded > programs -- a few pages maybe. In the case of multi-threaded > programs, it potentially sacrifices memory compactness in order to > reduce contention and cache line sharing. > you stand in the issue of pessimizing single-threaded apps vs MT. And if it's worth keeping phkmalloc for some apps. >> * Comment >> Might be wise to fix the comment about always using mmap since brk can >> be used. > > > Thanks, done. > > Jason No problem, always fun to have some code that you actually understand the idea behind and see how it's implemented. /Johan Bucht _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchJohan Bucht wrote:
>>> * Fork safety functions >>> Nice to have for all allocators and is something I missed having. >>> Would >>> like to have them regardless if your malloc becomes standard or not. >> >> >> >> I think that the implementation is currently fork-safe. The threads >> libraries call _malloc_prefork() and _malloc_postfork() in order to >> avoid locking issues. >> > Hmm, I meant that the _malloc_prefork() functions are independent from > your malloc allocation and I would like them committed regardless as > they make the life easier for other malloc implementation. Bah, the libc bits not the actual functions of course. =) Should probably go to sleep instead of writing semi-understandable sentences. /Johan Bucht _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Dec 11, 2005, at 7:16 PM, Johan Bucht wrote:
> Jason Evans wrote: >> >> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: >>> * Saw you used a tree to look up the allocation sizes at free, this >>> * differs from how I and Solaris libumem does it. Instead I went for >>> * prepending a 8/16 byte tag containing the size of the >>> allocation for >>> * lockless size lookup. >> >> Actually, my implementation prepends a 4 byte tag that contains >> allocated/free flags, and the size of the allocation. The trees >> are only used to look up the sizes of huge allocations (> 8 MB on >> 32-bit platforms, and > 256 MB on 64-bit platforms). Huge >> allocations start at chunk boundaries, so prepending a tag would >> require an entire extra page (and would cause potentially serious >> external fragmentation on 32-bit systems). >> > Isn't 8 byte alignment expected by some applications? Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion that we should be using 16 byte alignment on i386 due to SSE2/SSE3). If I understand your question correctly, you're asking how I get away with 4 byte tags. Consider that (assuming 8-byte quantum) a malloc (16) call must actually be serviced by at least 24 bytes internally, in order to pad to the next quantum size. If I used 8 byte tags, then malloc(17) would require 32 bytes internally. By using 4 byte tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally. At least one implementation (the OS X malloc implementation) uses 2 byte tags, but this has two problems: 1) unaligned memory access is slow on some architectures, and 2) it's too small to handle large object sizes, so a different allocation strategy has to be used starting at ~12 KB. > How do you know if a allocation is huge if you don't have a tag? I know an allocation is huge if its base address is chunk-aligned. The actual size is stored in a red-black tree node, which is allocated separately. Jason _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchJason Evans wrote:
>> Isn't 8 byte alignment expected by some applications? > > > Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion > that we should be using 16 byte alignment on i386 due to SSE2/SSE3). > If I understand your question correctly, you're asking how I get away > with 4 byte tags. Consider that (assuming 8-byte quantum) a malloc > (16) call must actually be serviced by at least 24 bytes internally, > in order to pad to the next quantum size. If I used 8 byte tags, > then malloc(17) would require 32 bytes internally. By using 4 byte > tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally. > At least one implementation (the OS X malloc implementation) uses 2 > byte tags, but this has two problems: 1) unaligned memory access is > slow on some architectures, and 2) it's too small to handle large > object sizes, so a different allocation strategy has to be used > starting at ~12 KB. > >> How do you know if a allocation is huge if you don't have a tag? > > > I know an allocation is huge if its base address is chunk-aligned. > The actual size is stored in a red-black tree node, which is > allocated separately. Ok, expected it was through the address, thanks for answering anyway. Gonna take some time reading through the code before asking more stupid questions. =) /Johan Bucht _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Sun, Dec 11, 2005 at 08:29:07PM -0500, Kris Kennaway wrote:
> I'll try to test this on a 4 CPU amd64 machine next. phkmalloc: # ./malloc-test 1024 10000000 1 Starting test with 1 thread... Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 2 Starting test with 2 threads... Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes. Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 3 Starting test with 3 threads... Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes. Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes. Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes. jemalloc: # ./malloc-test 1024 10000000 1 Starting test with 1 thread... Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 2 Starting test with 2 threads... Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes. Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 3 Starting test with 3 threads... Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes. Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes. Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 4 Starting test with 4 threads... Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes. Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes. Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes. Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 5 Starting test with 5 threads... Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes. Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes. Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes. Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes. Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes. So it's 1.1 times faster for single-threaded, and 107 times faster with 3 threads. With libthr instead of libpthread: phkmalloc: # ./malloc-test 1024 10000000 1 Starting test with 1 thread... Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 2 Starting test with 2 threads... Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes. Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 3 Starting test with 3 threads... Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes. Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes. Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 4 Starting test with 4 threads... Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes. Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes. Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes. Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 5 Starting test with 5 threads... Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes. Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes. Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes. Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes. Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes. jemalloc: # ./malloc-test 1024 10000000 1 Starting test with 1 thread... Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 2 Starting test with 2 threads... Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 3 Starting test with 3 threads... Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes. Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 4 Starting test with 4 threads... Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes. Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes. Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 5 Starting test with 5 threads... Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes. Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes. Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes. Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 6 Starting test with 6 threads... Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes. Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes. Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes. Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes. Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes. # ./malloc-test 1024 10000000 20 Starting test with 20 threads... Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes. Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes. Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes. Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes. Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes. Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes. Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes. Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes. Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes. Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes. Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes. Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes. Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes. Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes. Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes. c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes. Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes. Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes. Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes. Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes. So libthr is *much* faster than libpthread with both malloc implementations, but jemalloc is still 1.7 times faster for 1 thread and 80 times faster for 5 threads than phkmalloc. Kris P.S. Holy crap :) |
|
|
Re: New libc malloc patchKris Kennaway wrote:
>On Sun, Dec 11, 2005 at 08:29:07PM -0500, Kris Kennaway wrote: > > > >>I'll try to test this on a 4 CPU amd64 machine next. >> >> > > > >phkmalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes. > Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes. > Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes. > Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes. > > > libpthread performance really this bad on amd64 or is it broken? >jemalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes. > Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes. > Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes. > Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes. > Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes. > >So it's 1.1 times faster for single-threaded, and 107 times faster >with 3 threads. > > > the first thread getting higher priority than the later threads, makes these numbers a bit strange. >With libthr instead of libpthread: > >phkmalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes. > Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes. > Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes. > Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes. > Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes. > > > >jemalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 6 >Starting test with 6 threads... > Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes. > Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 20 >Starting test with 20 threads... > Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes. > Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes. > Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes. > Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes. > Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes. > Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes. > Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes. > Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes. > Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes. > Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes. > Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes. >c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes. > Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes. > Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes. > Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes. > > > >So libthr is *much* faster than libpthread with both malloc >implementations, but jemalloc is still 1.7 times faster for 1 thread >and 80 times faster for 5 threads than phkmalloc. > > > This test simply tests the local arena performance making it the worst case for serial allocators as all threads contend for the same lock. At the same time this is the best case scenario for jemalloc as all memory resides in the local arena. This means no contention at all unless the threads get hashed into the same arena. Basicly you are comparing worst case of phkmalloc vs best case of jemalloc. =) Would be nice if someone could run some supersmack benchmarks. >Kris > >P.S. Holy crap :) > > _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchJohan Bucht wrote:
> >> > Those times seems way too high even for a serial allocator, is > libpthread performance really this bad on amd64 or is it broken? > Not sure which thread library has been tested. _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote:
> Johan Bucht wrote: > > > >> > >Those times seems way too high even for a serial allocator, is > >libpthread performance really this bad on amd64 or is it broken? > > > Not sure which thread library has been tested. The quoted text referred to libpthread, the default on amd64. Kris |
|
|
Re: New libc malloc patchOn Sun, 11 Dec 2005, Jason Evans wrote:
> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: > > > * Locking primitive > > The biggest issue and as David Xu pointed out is probably the locking > > primitives. The SPINLOCK use has a limit in the threading library and > > makes is hard to have a lot of mutexes. I ended up using a wrapper > > around the umtx_lock function to get recursive mutexes and it would > > probably be better to extend the umtx functions to handle recursion. > > This would probably also be appreciated by other malloc > > implementations. > > Might be interesting to implement some of the ideas from the Linux > > futex > > implementation to help umtx. > > I have been contemplating creating a separate spinlock API that > doesn't require the threads library to track the spinlocks across > fork. This would (if I understand correctly) remove the current > static spinlock limitations. What about using pthread_atfork()? > As for supporting recursive spinlocks, I doubt that the overhead > would be acceptable in general. If I could get rid of the need for > the one recursive lock in malloc.c, I certainly would. =) Why do we need a recursive mutex? Can you not restructure the code so that it is not needed? -- DE _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Dec 11, 2005, at 9:58 PM, Daniel Eischen wrote:
> On Sun, 11 Dec 2005, Jason Evans wrote: >> I have been contemplating creating a separate spinlock API that >> doesn't require the threads library to track the spinlocks across >> fork. This would (if I understand correctly) remove the current >> static spinlock limitations. > > What about using pthread_atfork()? Aren't there potential ordering issues for that? It seems to me that the malloc pre-fork code would need to be run after any other pre- fork functions, in order to avoid potential deadlock, and that the malloc post-fork code would need to be run before any other post-fork functions, again to avoid potential deadlock. After looking at the spinlock code some more, it's no longer clear to me why the thread/thr_spinlock.c code uses a static array for spinlocks. It seems to me that it would work fine to allow the program to provide space for a spinlock and manually initialize it. This would remove the limitation on the number of spinlocks. >> As for supporting recursive spinlocks, I doubt that the overhead >> would be acceptable in general. If I could get rid of the need for >> the one recursive lock in malloc.c, I certainly would. =) > > Why do we need a recursive mutex? Can you not restructure the > code so that it is not needed? There is an internal arena that the malloc code uses for allocating internal data structures. In some cases, the internal arena has to recursively allocate. If there were no object caching, it might be possible to pre-allocate, such that recursion never happens, but given the object caching, it's difficult to reason about precisely what will occur internally for a single malloc/free operation. There are some other possibilities, but nothing I've thought of so far is simple or elegant. Fixing this would make all locking in the malloc code a bit cheaper, which is why it continues to bother me. Jason _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Sun, 2005-Dec-11 16:04:09 -0800, Jason Evans wrote:
>early, during the load of the libbitmap module. My best guess is >that it is stuck trying to acquire the Xlib lock, but cannot be sure, >since I don't know how to get debug symbols for the loaded X module. I've run into a related problem in the past. Eric Anholt's advice was: >You could compile all the modules into the server by adding "#define >DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server >(see scripts/configure for other defines that are added). Something similar probably works in X.org. I found that I had to remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'. -- Peter Jeremy _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchKris Kennaway wrote:
> On Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote: > >>Johan Bucht wrote: >> >>>Those times seems way too high even for a serial allocator, is >>>libpthread performance really this bad on amd64 or is it broken? >>> >> >>Not sure which thread library has been tested. > > > The quoted text referred to libpthread, the default on amd64. > > Kris _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
|
|
Re: New libc malloc patchOn Mon, Dec 12, 2005 at 03:18:01PM +0800, David Xu wrote:
> Kris Kennaway wrote: > >On Mon, Dec 12, 2005 at 01:56:09PM +0800, David Xu wrote: > > > >>Johan Bucht wrote: > >> > >>>Those times seems way too high even for a serial allocator, is > >>>libpthread performance really this bad on amd64 or is it broken? > >>> > >> > >>Not sure which thread library has been tested. > > > > > >The quoted text referred to libpthread, the default on amd64. > > > >Kris > May try libthr too ? :-) thoroughly :) Kris |
|
|
Re: New libc malloc patchOn Mon, 2005-12-12 at 18:02 +1100, Peter Jeremy wrote:
> On Sun, 2005-Dec-11 16:04:09 -0800, Jason Evans wrote: > >early, during the load of the libbitmap module. My best guess is > >that it is stuck trying to acquire the Xlib lock, but cannot be sure, > >since I don't know how to get debug symbols for the loaded X module. > > I've run into a related problem in the past. Eric Anholt's advice was: > >You could compile all the modules into the server by adding "#define > >DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server > >(see scripts/configure for other defines that are added). > > Something similar probably works in X.org. I found that I had to > remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'. changes bug reproducibility due to the lack of the loader process. If you can reproduce with xorg-server-snap, then WITH_DEBUG=1 during compile will get you a server with debug symbols. -- Eric Anholt eta@... http://people.freebsd.org/~anholt/ anholt@... |
|
|
Re: New libc malloc patchOn Monday 12 December 2005 03:16, Johan Bucht wrote:
> Isn't 8 byte alignment expected by some applications? > How do you know if a allocation is huge if you don't have a tag? > Something more to read up on i guess. =) Actually, I'm glad you pointed that out. My own applications use SSE2 primitives a lot and those guys need to be allocated on 16-byte boundaries. I currently use phkmalloc which has the nice property that small allocations are aligned to the next greater power-of-2 boundary which is (for me) always at least 16. Since you are targetting modern architectures with this allocator, could you please set the minimum alignment on all i386 and amd64 processors to 16. For what its worth, when gcc is using SSE instructions, it assumes this for all memory, stack or malloc. _______________________________________________ freebsd-current@... mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@..." |
| < Prev | 1 - 2 - 3 - 4 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |