|
View:
New views
14 Messages
—
Rating Filter:
Alert me
|
|
|
[RFC] Pixman & compositing with overlapping source and destination pixel dataHi,
What kind of guarantees (or the lack of) pixman and XRender are supposed to provide when dealing with overlapping parts of images? The practical use case could be scrolling of data inside of a single big image. If rendering with overlapped source and destination areas is not supported, a temporary image has to be created to achieve expected result and this is an additional performance hit. With the ongoing pixman refactoring and fixup for various inconsistencies, it would be nice if the status of overlapped rendering could be clarified too. The whole thing is kind of like memcpy vs. memove The possible cases to handle are: 1. User provides the same destination and either source or mask images to some of the pixman functions with overlapping source and destination areas. 2. User provides same memory area, but wrapped into different pixman images to pixman rendering function, so that source and destination pixel data still overlaps Generic path in pixman fetches lines into a temporary buffer, so everything should be fine in horizontal direction. For vertical direction, reversing the order of handling lines (and using negative stride) could be also possible. So if overlapped rendering is to be supported, technically there should not be any big problems. Performance would be worse in the case when overlapped rendering has to be sent to the generic path, but it would provide incorrect (or unexpected) results in the fast path anyway. Overlapping-aware fast path functions can be also created. What do you think about the whole issue? -- Best regards, Siarhei Siamashka _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Wed, 2009-06-03 at 12:42 +0300, Siarhei Siamashka wrote:
> Hi, > > What kind of guarantees (or the lack of) pixman and XRender are supposed to > provide when dealing with overlapping parts of images? > > The practical use case could be scrolling of data inside of a single big > image. If rendering with overlapped source and destination areas is not > supported, a temporary image has to be created to achieve expected result > and this is an additional performance hit. > What do you think about the whole issue? We're seeing exactly this kind of problem when scrolling in Firefox. We can speed up the primitives that Firefox uses to draw, but at the moment scrolling is limited by the interface from Xorg to Pixman used for XCopyArea. We know that Pixman is being used, because we see a performance improvement when we turn on a NEON version of pixman_blt(). When an overlapped XCopyArea is issued - as is typical for a short scroll in any direction - XCopyArea breaks up the request into individual scanline fetches and puts. This produces correct results, but because each scanline is done via two full Pixman call chains, it is hideously slow. By that I mean that it can be about a third of the speed of a full copy-out followed by copy-back. (In fact, I think XCopyArea does this regardless of whether the areas actually overlap, if the source and dest images are the same. Ugh.) If Pixman were able to guarantee that "reasonable" overlapped copy requests would behave like a "move", this highly inefficient special case could be removed from Xorg, and there would be more opportunities for optimisation within Pixman itself, where they are more likely to be accepted. "Reasonable" in this case could mean that the same image must be referenced for both ends, rather than the "accidental overlap" scenario. The row-stride argument of the blitters already accepts a signed argument, so in theory this should work just fine. -- ------ From: Jonathan Morton jonathan.morton@... _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataSiarhei Siamashka <siarhei.siamashka@...> writes:
> What kind of guarantees (or the lack of) pixman and XRender are supposed to > provide when dealing with overlapping parts of images? (Adding xorg-devel). See this thread: http://lists.freedesktop.org/archives/xorg/2008-October/039346.html The guarantee that I would suggest for Render and pixman is that if any pixel is both read and written in the same request, then the result of the whole request is undefined, except that it obeys the clipping rules. > The practical use case could be scrolling of data inside of a single big > image. If rendering with overlapped source and destination areas is not > supported, a temporary image has to be created to achieve expected result > and this is an additional performance hit. Yes, scrolling is one thing that the current pixman API doesn't really provide. 'pixman_blt()' only deals with cases where the source and destination don't overlap. I think the best solution is to move all of the X primitives (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would probably look something like this: void pixman_copy_area (pixman_image_t *src, pixman_image_t *dest, pixman_gc_t *gc, int src_x, int src_y, int width, int height, int dest_x, int dest_y); and that would be guaranteed to handle overlapping src and dest. A pixman_gc_t would be a new type of object, corresponding to an X GC. pixman_blt() would then become a deprecated wrapper that would just call pixman_copy_area(). Same for pixman_fill() and a new pixman_fill_rectangles(). > Generic path in pixman fetches lines into a temporary buffer, so everything > should be fine in horizontal direction. For vertical direction, reversing the > order of handling lines (and using negative stride) could be also possible. > So if overlapped rendering is to be supported, technically there should not be > any big problems. Performance would be worse in the case when overlapped > rendering has to be sent to the generic path, but it would provide incorrect > (or unexpected) results in the fast path anyway. Overlapping-aware fast path > functions can be also created. I am not necessarily opposed to making Render guarantee correct operation for overlapping sources and destinations if someone wants to fix the implementation. However, - many of the operators aren't all that useful when used with overlapping sources and destinations; - to be useful for scrolling windows, applications need to get GraphicsExposure events, and Render currently doesn't generate those; - for images with transformations, you will need to make a copy anyway; - you'll need to fix the server and drivers so that they either implement this correctly, or fall back to software. Soren _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataCairo can do a transformation when copying (like a rotation or scale). I
think it is best to say that no guarantee is made when the source and destination intersect. You could say it works if there is no transformation other than integer translation but it seems a little artificial to me. It seems most consistent to say that nothing is allowed. Also on modern graphics systems it is probably a lot easier to write and just as fast to just draw the entire image over again in it's new position, rather than trying to copy areas. Or if copying an area really does help, making the entire image in an off-screen surface and copying from that would work better. Soeren Sandmann wrote: > Siarhei Siamashka <siarhei.siamashka@...> writes: > >> What kind of guarantees (or the lack of) pixman and XRender are supposed to >> provide when dealing with overlapping parts of images? -- Bill Spitzak, Senior Software Engineer The Foundry, 1 Wardour Street, London, W1D 6PA, UK Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web: www.thefoundry.co.uk The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027 _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Thursday 04 June 2009, Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka@...> writes: > > What kind of guarantees (or the lack of) pixman and XRender are supposed > > to provide when dealing with overlapping parts of images? > > (Adding xorg-devel). See this thread: > > http://lists.freedesktop.org/archives/xorg/2008-October/039346.html > > The guarantee that I would suggest for Render and pixman is that if > any pixel is both read and written in the same request, then the > result of the whole request is undefined, except that it obeys the > clipping rules. > > > The practical use case could be scrolling of data inside of a single big > > image. If rendering with overlapped source and destination areas is not > > supported, a temporary image has to be created to achieve expected result > > and this is an additional performance hit. > > Yes, scrolling is one thing that the current pixman API doesn't really > provide. 'pixman_blt()' only deals with cases where the source and > destination don't overlap. > > I think the best solution is to move all of the X primitives > (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would > probably look something like this: > > void > pixman_copy_area (pixman_image_t *src, > pixman_image_t *dest, > pixman_gc_t *gc, > int src_x, > int src_y, > int width, > int height, > int dest_x, > int dest_y); > > and that would be guaranteed to handle overlapping src and dest. A > pixman_gc_t would be a new type of object, corresponding to an X GC. > > pixman_blt() would then become a deprecated wrapper that would just > call pixman_copy_area(). Same for pixman_fill() and a new > pixman_fill_rectangles(). I'm not sure about pixman_gc_t since most of the needed operations are just simple copies. What about starting with just introducing a variant of 'pixman_blt' which is overlapping aware? I created a work-in-progress branch with 'pixman_blt' function (generic C implementation for now) extended to support overlapped source/destination case. A simple test program is also included: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt Making use of the already existing SIMD optimized pixel copy functions should provide fast scrolling in all the directions except for from left to right. This special case will require a SIMD optimized backwards copy. I wonder if it makes sense to drop delegates support for pixman_blt and make call chain shorter when introducing SIMD optimized copies? It seems to be a little bit overdesigned here. Running test program for the current pixman master (SSE2 optimized): $ test/overlapped-blt-test bpp=8, supported=0, normal_ok=0, overlapped_ok=0 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 Running test program for the pixman from this branch with generic C version of pixman_blt (8bpp now uses a fallback to generic C implementation): $ test/overlapped-blt-test bpp=8, supported=1, normal_ok=1, overlapped_ok=1 bpp=16, supported=1, normal_ok=1, overlapped_ok=0 bpp=32, supported=1, normal_ok=1, overlapped_ok=0 -- Best regards, Siarhei Siamashka _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataHi,
> On Thursday 04 June 2009, Soeren Sandmann wrote: > > Siarhei Siamashka <siarhei.siamashka@...> writes: > > > What kind of guarantees (or the lack of) pixman and XRender are supposed > > > to provide when dealing with overlapping parts of images? > > > > (Adding xorg-devel). See this thread: > > > > http://lists.freedesktop.org/archives/xorg/2008-October/039346.html > > > > The guarantee that I would suggest for Render and pixman is that if > > any pixel is both read and written in the same request, then the > > result of the whole request is undefined, except that it obeys the > > clipping rules. > > > > > The practical use case could be scrolling of data inside of a single big > > > image. If rendering with overlapped source and destination areas is not > > > supported, a temporary image has to be created to achieve expected result > > > and this is an additional performance hit. > > > > Yes, scrolling is one thing that the current pixman API doesn't really > > provide. 'pixman_blt()' only deals with cases where the source and > > destination don't overlap. > > > > I think the best solution is to move all of the X primitives > > (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would > > probably look something like this: > > > > void > > pixman_copy_area (pixman_image_t *src, > > pixman_image_t *dest, > > pixman_gc_t *gc, > > int src_x, > > int src_y, > > int width, > > int height, > > int dest_x, > > int dest_y); > > > > and that would be guaranteed to handle overlapping src and dest. A > > pixman_gc_t would be a new type of object, corresponding to an X GC. > > > > pixman_blt() would then become a deprecated wrapper that would just > > call pixman_copy_area(). Same for pixman_fill() and a new > > pixman_fill_rectangles(). > > I'm not sure about pixman_gc_t since most of the needed operations are just > simple copies. What about starting with just introducing a variant > of 'pixman_blt' which is overlapping aware? The pixman_blt() interface is misdesigned for two reasons: (1) the strides are given in number-of-uint32_ts, which gratuitously limits the types of images that can be processed, and (2) it can fail if it doesn't like the input for some reason. At the same time, having the core primitives available on the client side is useful in some cases, and the software implementation of them can more easily be optimized with SIMD instructions in pixman. Moving core rendering into pixman solves both issues at the same time. But that said, I am not opposed to extending pixman_blt() to support overlapping copies. That is certainly a simpler first step. > I created a work-in-progress branch with 'pixman_blt' function (generic C > implementation for now) extended to support overlapped source/destination > case. A simple test program is also included: > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt > > Making use of the already existing SIMD optimized pixel copy functions should > provide fast scrolling in all the directions except for from left to right. > This special case will require a SIMD optimized backwards copy. > > I wonder if it makes sense to drop delegates support for pixman_blt and make > call chain shorter when introducing SIMD optimized copies? It seems to be a > little bit overdesigned here. How would you support SSE2 and MMX in the same binary then? Also, I really don't see much potential for saving here. For a NEON implementation of blt, the callchain would be: pixman_blt() -> _pixman_implementation_blt() -> neon_blt() and getting rid of delegates wouldn't really affect that at all. You could get rid of the _pixman_implementation_blt() call by making it a macro, but as I mentioned before, gcc turns it into a tail call that reused the arguments on the stack, so the overhead really is minimal. Soren _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Tuesday 20 October 2009, Soeren Sandmann wrote:
[...] > > I'm not sure about pixman_gc_t since most of the needed operations are > > just simple copies. What about starting with just introducing a variant > > of 'pixman_blt' which is overlapping aware? > > The pixman_blt() interface is misdesigned for two reasons: (1) the > strides are given in number-of-uint32_ts, which gratuitously limits > the types of images that can be processed, and (2) it can fail if it > doesn't like the input for some reason. > > At the same time, having the core primitives available on the client > side is useful in some cases, and the software implementation of them > can more easily be optimized with SIMD instructions in pixman. > > Moving core rendering into pixman solves both issues at the same time. I don't have any strong opinion about API updates. In any case, smooth upgrade path needs to be taken care of and the users should be prevented from using incompatible versions of client applications/libraries and pixmap. An introduction of a new function may be the best way, it can also solve some of the design issues. > But that said, I am not opposed to extending pixman_blt() to support > overlapping copies. That is certainly a simpler first step. Yes, the functionality itself can be introduced first (without breaking anything). Wrapping it into a better API can be done as the natural next step. > > I created a work-in-progress branch with 'pixman_blt' function (generic C > > implementation for now) extended to support overlapped source/destination > > case. A simple test program is also included: > > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt > > > > Making use of the already existing SIMD optimized pixel copy functions > > should provide fast scrolling in all the directions except for from left > > to right. This special case will require a SIMD optimized backwards copy. > > > > I wonder if it makes sense to drop delegates support for pixman_blt and > > make call chain shorter when introducing SIMD optimized copies? It seems > > to be a little bit overdesigned here. > > How would you support SSE2 and MMX in the same binary then? The most simple way is to do it in my opinion is the following. First introduce something like 'pixman_init' function. Right now CPU type detection is done on the first call to the function. It introduces some minor overhead by having an extra pointer check on each function call. Another problem is that we can't be completely sure that CPU capabilities detection check is always fully reentrant. For example, some platforms may try to set a signal handler and expect to catch SIGILL or something like this. This initialization function would just detect CPU capabilities and set some function pointers. The whole CPU-specific implementation of 'pixman_blt' may be just called via this pointer directly by a client. Or 'pixman_blt' can be just a small thunk which does a call via function pointer, passes exactly the same arguments to it and does nothing more. In this case there will be really no excuse for the compiler for not using tail call, see below. > Also, I really don't see much potential for saving here. For a NEON > implementation of blt, the callchain would be: > > pixman_blt() -> _pixman_implementation_blt() -> neon_blt() > > and getting rid of delegates wouldn't really affect that at all. You > could get rid of the _pixman_implementation_blt() call by making it a > macro, but as I mentioned before, gcc turns it into a tail call that > reused the arguments on the stack, so the overhead really is minimal. On what kind of platform and with which version of gcc are you getting proper tail call here? I don't see it being used and the overhead is rather hefty, which is also confirmed by benchmarking and profiling. Even if gcc can reuse some part of the arguments which are already on stack in some cases, different platforms may have different ABI and calling conventions. For example, for ARM and x86-64, the first few arguments are passed in registers, the rest is on stack. Relying on the compiler to always do the job properly identifying tail call possibilities in all cases may be not the very best idea. C: PIXMAN_EXPORT pixman_bool_t pixman_blt (uint32_t *src_bits, uint32_t *dst_bits, int src_stride, int dst_stride, int src_bpp, int dst_bpp, int src_x, int src_y, int dst_x, int dst_y, int width, int height) { if (!imp) imp = _pixman_choose_implementation (); return _pixman_implementation_blt (imp, src_bits, dst_bits, src_stride, dst_stride, src_bpp, dst_bpp, src_x, src_y, dst_x, dst_y, width, height); } x86, gcc 4.3.2: 00000420 <pixman_blt>: 420: 55 push %ebp 421: 89 e5 mov %esp,%ebp 423: 83 ec 38 sub $0x38,%esp 426: 8b 15 00 00 00 00 mov 0x0,%edx 42c: 85 d2 test %edx,%edx 42e: 74 68 je 498 <pixman_blt+0x78> 430: 8b 45 34 mov 0x34(%ebp),%eax 433: 89 44 24 30 mov %eax,0x30(%esp) 437: 8b 45 30 mov 0x30(%ebp),%eax 43a: 89 44 24 2c mov %eax,0x2c(%esp) 43e: 8b 45 2c mov 0x2c(%ebp),%eax 441: 89 44 24 28 mov %eax,0x28(%esp) 445: 8b 45 28 mov 0x28(%ebp),%eax 448: 89 44 24 24 mov %eax,0x24(%esp) 44c: 8b 45 24 mov 0x24(%ebp),%eax 44f: 89 44 24 20 mov %eax,0x20(%esp) 453: 8b 45 20 mov 0x20(%ebp),%eax 456: 89 44 24 1c mov %eax,0x1c(%esp) 45a: 8b 45 1c mov 0x1c(%ebp),%eax 45d: 89 44 24 18 mov %eax,0x18(%esp) 461: 8b 45 18 mov 0x18(%ebp),%eax 464: 89 44 24 14 mov %eax,0x14(%esp) 468: 8b 45 14 mov 0x14(%ebp),%eax 46b: 89 44 24 10 mov %eax,0x10(%esp) 46f: 8b 45 10 mov 0x10(%ebp),%eax 472: 89 44 24 0c mov %eax,0xc(%esp) 476: 8b 45 0c mov 0xc(%ebp),%eax 479: 89 44 24 08 mov %eax,0x8(%esp) 47d: 8b 45 08 mov 0x8(%ebp),%eax 480: 89 44 24 04 mov %eax,0x4(%esp) 484: a1 00 00 00 00 mov 0x0,%eax 489: 89 04 24 mov %eax,(%esp) 48c: e8 fc ff ff ff call 48d <pixman_blt+0x6d> 491: c9 leave 492: c3 ret 493: 90 nop 494: 8d 74 26 00 lea 0x0(%esi),%esi 498: e8 fc ff ff ff call 499 <pixman_blt+0x79> 49d: a3 00 00 00 00 mov %eax,0x0 4a2: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 4a8: eb 86 jmp 430 <pixman_blt+0x10> 4aa: 8d b6 00 00 00 00 lea 0x0(%esi),%esi ARM, gcc 4.3.4: 000003a4 <pixman_blt>: 3a4: e92d41f0 push {r4, r5, r6, r7, r8, lr} 3a8: e59f4088 ldr r4, [pc, #136] ; 438 <pixman_blt+0x94> 3ac: e1a08001 mov r8, r1 3b0: e24dd028 sub sp, sp, #40 ; 0x28 3b4: e5941000 ldr r1, [r4] 3b8: e1a06000 mov r6, r0 3bc: e3510000 cmp r1, #0 ; 0x0 3c0: e1a07002 mov r7, r2 3c4: e1a05003 mov r5, r3 3c8: 0a000017 beq 42c <pixman_blt+0x88> 3cc: e59dc040 ldr ip, [sp, #64] 3d0: e5940000 ldr r0, [r4] 3d4: e59de044 ldr lr, [sp, #68] 3d8: e58dc004 str ip, [sp, #4] 3dc: e59dc048 ldr ip, [sp, #72] 3e0: e58de008 str lr, [sp, #8] 3e4: e59de04c ldr lr, [sp, #76] 3e8: e58dc00c str ip, [sp, #12] 3ec: e59dc050 ldr ip, [sp, #80] 3f0: e58de010 str lr, [sp, #16] 3f4: e59d405c ldr r4, [sp, #92] 3f8: e58dc014 str ip, [sp, #20] 3fc: e59de054 ldr lr, [sp, #84] 400: e59dc058 ldr ip, [sp, #88] 404: e1a01006 mov r1, r6 408: e1a02008 mov r2, r8 40c: e1a03007 mov r3, r7 410: e58d5000 str r5, [sp] 414: e58de018 str lr, [sp, #24] 418: e58dc01c str ip, [sp, #28] 41c: e58d4020 str r4, [sp, #32] 420: ebfffffe bl 0 <_pixman_implementation_blt> 424: e28dd028 add sp, sp, #40 ; 0x28 428: e8bd81f0 pop {r4, r5, r6, r7, r8, pc} 42c: ebfffffe bl 0 <_pixman_choose_implementation> 430: e5840000 str r0, [r4] 434: eaffffe4 b 3cc <pixman_blt+0x28> 438: 00000000 .word 0x00000000 -- Best regards, Siarhei Siamashka _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataSiarhei Siamashka <siarhei.siamashka@...> writes:
> First introduce something like 'pixman_init' function. Right now CPU type > detection is done on the first call to the function. It introduces some > minor overhead by having an extra pointer check on each function call. > Another problem is that we can't be completely sure that CPU capabilities > detection check is always fully reentrant. For example, some platforms may > try to set a signal handler and expect to catch SIGILL or something like > this. > > This initialization function would just detect CPU capabilities and set some > function pointers. The whole CPU-specific implementation of 'pixman_blt' > may be just called via this pointer directly by a client. Or 'pixman_blt' can > be just a small thunk which does a call via function pointer, passes exactly > the same arguments to it and does nothing more. In this case there will be > really no excuse for the compiler for not using tail call, see > below. Adding a pixman_init() that applications would be required to call first, would not be a compatible change. If we are designing new API, then I really think it should be done in such a way that it can be extended to handle the core rendering primitives. It does likely make sense to make the pixman_implementation_t type public at some point (renamed to pixman_t probably) and then pass it directly to the various entry points. This would be necessary if we add hardware acceleration to pixman. > > Also, I really don't see much potential for saving here. For a NEON > > implementation of blt, the callchain would be: > > > > pixman_blt() -> _pixman_implementation_blt() -> neon_blt() > > > > and getting rid of delegates wouldn't really affect that at all. You > > could get rid of the _pixman_implementation_blt() call by making it a > > macro, but as I mentioned before, gcc turns it into a tail call that > > reused the arguments on the stack, so the overhead really is minimal. > > On what kind of platform and with which version of gcc are you getting > proper tail call here? I meant that the _pixman_implemenation_blt() -> neon_blt() would be a tail call. GCC v 4.3.2 on x86-32 produces: _pixman_implementation_blt: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx popl %ebp movl 12(%edx), %ecx jmp *%ecx .size _pixman_implementation_blt, .-_pixman_implementation_blt .p2align 4,,15 > I don't see it being used and the overhead is rather hefty, which is > also confirmed by benchmarking and profiling. Well, with a microbenchmark you can make anything stand out. Ultimately, this function is called from XCopyArea(), and compared to the marshalling of the client call and the long call chain inside the X server, these 35 instructions or so, really are not very significant. I think Jonathan said that pixman_blt() was getting called once per scanline, but I'm pretty sure that's not the case. (Or if it is, that would be the first thing to fix before worrying about eliminating this call). Soren _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Wednesday 21 October 2009, Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka@...> writes: > > First introduce something like 'pixman_init' function. Right now CPU type > > detection is done on the first call to the function. It introduces some > > minor overhead by having an extra pointer check on each function call. > > Another problem is that we can't be completely sure that CPU capabilities > > detection check is always fully reentrant. For example, some platforms > > may try to set a signal handler and expect to catch SIGILL or something > > like this. > > > > This initialization function would just detect CPU capabilities and set > > some function pointers. The whole CPU-specific implementation of > > 'pixman_blt' may be just called via this pointer directly by a client. Or > > 'pixman_blt' can be just a small thunk which does a call via function > > pointer, passes exactly the same arguments to it and does nothing more. > > In this case there will be really no excuse for the compiler for not > > using tail call, see > > below. > > Adding a pixman_init() that applications would be required to call > first, would not be a compatible change. If we are designing new API, > then I really think it should be done in such a way that it can be > extended to handle the core rendering primitives. > It does likely make sense to make the pixman_implementation_t type > public at some point (renamed to pixman_t probably) and then pass it > directly to the various entry points. This would be necessary if we > add hardware acceleration to pixman. > > > > Also, I really don't see much potential for saving here. For a NEON > > > implementation of blt, the callchain would be: > > > > > > pixman_blt() -> _pixman_implementation_blt() -> neon_blt() > > > > > > and getting rid of delegates wouldn't really affect that at all. You > > > could get rid of the _pixman_implementation_blt() call by making it a > > > macro, but as I mentioned before, gcc turns it into a tail call that > > > reused the arguments on the stack, so the overhead really is minimal. http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2 It should be more or less final and it adds full support for pixman_blt source/destination areas overlapping to pixman. Also delegates are removed for pixman_blt, they really look like an overkill for this simple function. > > On what kind of platform and with which version of gcc are you getting > > proper tail call here? > > I meant that the > > _pixman_implemenation_blt() -> neon_blt() > > would be a tail call. GCC v 4.3.2 on x86-32 produces: > > _pixman_implementation_blt: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %edx > popl %ebp > movl 12(%edx), %ecx > jmp *%ecx > .size _pixman_implementation_blt, > .-_pixman_implementation_blt > .p2align 4,,15 > > I don't see it being used and the overhead is rather hefty, which is > > also confirmed by benchmarking and profiling. > > Well, with a microbenchmark you can make anything stand out. > Ultimately, this function is called from XCopyArea(), and compared to > the marshalling of the client call and the long call chain inside the > X server, these 35 instructions or so, really are not very > significant. The code around XCopyArea also needs some cleanups and optimizations. Application benchmarks show that generally only a fraction of time is spent in the leaf pixel processing functions. The rest is spread across various layers. It's quite hard to start optimizing and simplifying all this stuff (and see the real effect) because lots of small cumulative performance losses can be found in a lot of places. Not all the images are large enough to ignore call overhead, there are also small icons, UI elements, fonts... Long call chains are also bad because processors usually have some limit on the depth of return address prediction. It's just 8 for ARM Cortex-A8, 12 for the original AMD Athlon, 24 for AMD Phenom. Intel most likely also has some limit here, but I did not find it easily. So when going up and down through some insanely long call chains frequently, a lot of function returns may suffer from mispredict penalty. And such chains may be very long because they may be originating from the user application and come through lots of layers. Whenever it's easy to remove some of the redundant nested calls, it's better to do this. Also callgraphs will have less boxes and will become easier to decipher :) -- Best regards, Siarhei Siamashka _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn 19-10-09 23:13, Siarhei Siamashka wrote:
> On Thursday 04 June 2009, Soeren Sandmann wrote: >> Siarhei Siamashka<siarhei.siamashka@...> writes: >>> What kind of guarantees (or the lack of) pixman and XRender are supposed >>> to provide when dealing with overlapping parts of images? >> >> (Adding xorg-devel). See this thread: >> >> http://lists.freedesktop.org/archives/xorg/2008-October/039346.html >> >> The guarantee that I would suggest for Render and pixman is that if >> any pixel is both read and written in the same request, then the >> result of the whole request is undefined, except that it obeys the >> clipping rules. >> >>> The practical use case could be scrolling of data inside of a single big >>> image. If rendering with overlapped source and destination areas is not >>> supported, a temporary image has to be created to achieve expected result >>> and this is an additional performance hit. >> >> Yes, scrolling is one thing that the current pixman API doesn't really >> provide. 'pixman_blt()' only deals with cases where the source and >> destination don't overlap. >> >> I think the best solution is to move all of the X primitives >> (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would >> probably look something like this: >> >> void >> pixman_copy_area (pixman_image_t *src, >> pixman_image_t *dest, >> pixman_gc_t *gc, >> int src_x, >> int src_y, >> int width, >> int height, >> int dest_x, >> int dest_y); >> >> and that would be guaranteed to handle overlapping src and dest. A >> pixman_gc_t would be a new type of object, corresponding to an X GC. >> >> pixman_blt() would then become a deprecated wrapper that would just >> call pixman_copy_area(). Same for pixman_fill() and a new >> pixman_fill_rectangles(). > > I'm not sure about pixman_gc_t since most of the needed operations are just > simple copies. What about starting with just introducing a variant > of 'pixman_blt' which is overlapping aware? > > I created a work-in-progress branch with 'pixman_blt' function (generic C > implementation for now) extended to support overlapped source/destination > case. A simple test program is also included: > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt Would using said branch give me 'magically' a performance boost (e.g. not make firefox unusably slow as it is now on an 600MHz cortex a8) or would I need to patch other libs (e.g. xrender) as well? regards, Koen _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Friday 23 October 2009, Koen Kooi wrote:
> > I'm not sure about pixman_gc_t since most of the needed operations are just > > simple copies. What about starting with just introducing a variant > > of 'pixman_blt' which is overlapping aware? > > > > I created a work-in-progress branch with 'pixman_blt' function (generic C > > implementation for now) extended to support overlapped source/destination > > case. A simple test program is also included: > > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt First, this branch is outdated. There is a new branch with the final code :) http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2 > Would using said branch give me 'magically' a performance boost (e.g. > not make firefox unusably slow as it is now on an 600MHz cortex a8) or > would I need to patch other libs (e.g. xrender) as well? Not really, it's just a small extension of pixman functionality. Currently the handling of overlapped blt operation (for software rendering) is done in xorg-server. As it is the responsibility of pixman to provide CPU-specific SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to move this work to pixman. So the next steps are to add NEON optimizations to pixman_plt and make sure that xserver takes advantage of these optimizations for the overlapped blit too. As for improving scrolling performance (and assuming a standard fbdev driver), the most important thing is to improve framebuffer memory performance. Right now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling write-through cache for it (with a simple kernel patch) improves scrolling and moving windows performance by 4x-5x factor (unless shadow framebuffer is used, which is also not good for performance). This works fine if nothing but CPU can modify framebuffer memory. But if GPU or DSP can also access framebuffer memory or compositing manager is used, everything gets more complicated. Cache invalidate operations will have to be inserted in appropriate places in order to ensure memory coherency and uniform view of its content from all the units. If default write-back cache is used instead of write-through, cache flush operations are needed too. Unpatched firefox is also quite slow for another reason - it tries to always work with 32bpp data internally, no matter what color depth is used for desktop. -- Best regards, Siarhei Siamashka _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn 26-10-09 01:57, Siarhei Siamashka wrote:
> On Friday 23 October 2009, Koen Kooi wrote: >>> I'm not sure about pixman_gc_t since most of the needed operations are just >>> simple copies. What about starting with just introducing a variant >>> of 'pixman_blt' which is overlapping aware? >>> >>> I created a work-in-progress branch with 'pixman_blt' function (generic C >>> implementation for now) extended to support overlapped source/destination >>> case. A simple test program is also included: >>> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt > > First, this branch is outdated. There is a new branch with the final code :) > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2 > >> Would using said branch give me 'magically' a performance boost (e.g. >> not make firefox unusably slow as it is now on an 600MHz cortex a8) or >> would I need to patch other libs (e.g. xrender) as well? > > Not really, it's just a small extension of pixman functionality. Currently > the handling of overlapped blt operation (for software rendering) is done > in xorg-server. As it is the responsibility of pixman to provide CPU-specific > SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to > move this work to pixman. So the next steps are to add NEON optimizations > to pixman_plt and make sure that xserver takes advantage of these > optimizations for the overlapped blit too. So: 1) merge your branch into pixman master 2) move overlapped blit handling from xserver-xorg to pixman 3) add SIMD optimizations to pixman Would give us better scrolling, right? > As for improving scrolling performance (and assuming a standard fbdev driver), > the most important thing is to improve framebuffer memory performance. Right > now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling > write-through cache for it (with a simple kernel patch) improves scrolling > and moving windows performance by 4x-5x factor (unless shadow framebuffer is > used, which is also not good for performance). This works fine if nothing > but CPU can modify framebuffer memory. But if GPU or DSP can also access > framebuffer memory or compositing manager is used, everything gets more > complicated. Cache invalidate operations will have to be inserted in > appropriate places in order to ensure memory coherency and uniform view > of its content from all the units. If default write-back cache is used > instead of write-through, cache flush operations are needed too. I have no idea how the sgx or dsp handle the framebuffer, but I'm using both. > Unpatched firefox is also quite slow for another reason - it tries to > always work with 32bpp data internally, no matter what color depth is > used for desktop. I'm already using your patch for that :) regards, Koen _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataSiarhei Siamashka <siarhei.siamashka@...> writes:
> write-through cache for it (with a simple kernel patch) improves scrolling > and moving windows performance by 4x-5x factor (unless shadow framebuffer is > used, which is also not good for performance). What is the issue with the shadow framebuffer? It does add some extra memory traffic from the final copy, but I would have thought that everything else becoming much faster, would more than make up for it. Soren _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
|
|
Re: [RFC] Pixman & compositing with overlapping source and destination pixel dataOn Mon, 2009-10-26 at 17:52 +0100, Soeren Sandmann wrote:
> > write-through cache for it (with a simple kernel patch) improves scrolling > > and moving windows performance by 4x-5x factor (unless shadow framebuffer is > > used, which is also not good for performance). > > What is the issue with the shadow framebuffer? It does add some extra > memory traffic from the final copy, but I would have thought that > everything else becoming much faster, would more than make up for it. The shadow framebuffer has one additional source of overhead: the damage-region tracking it does to accumulate multiple updates into a single post-copy. If you do a lot of small writes to the shadow, the damage region gets complex and slows everything down. We have a trivial pathological reproduction case, involving spraying random small rectangles all over the screen. It also seems that enabling shadow also forces on backing stores, which is another unnecessary memory overhead and probably involves more copying. It is of course possible to imagine a "sloppy damage" tracker, to reduce overhead by limiting damage-region complexity. ISTR writing one for a VNC server a long time ago, simply because I didn't want or need precise region tracking. If we don't want to deal with uncached framebuffers in Pixman, and nobody seems entirely comfortable with cached physical framebuffers, maybe it's worth my digging my old code out and seeing if it can be adapted. -- ------ From: Jonathan Morton jonathan.morton@... _______________________________________________ cairo mailing list cairo@... http://lists.cairographics.org/mailman/listinfo/cairo |
| Free embeddable forum powered by Nabble | Forum Help |