[RFC] Pixman & compositing with overlapping source and destination pixel data

View: New views
14 Messages — Rating Filter:   Alert me  

[RFC] Pixman & compositing with overlapping source and destination pixel data

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

What kind of guarantees (or the lack of) pixman and XRender are supposed to
provide when dealing with overlapping parts of images?

The practical use case could be scrolling of data inside of a single big
image. If rendering with overlapped source and destination areas is not
supported, a temporary image has to be created to achieve expected result
and this is an additional performance hit.

With the ongoing pixman refactoring and fixup for various inconsistencies, it
would be nice if the status of overlapped rendering could be clarified too.
The whole thing is kind of like memcpy vs. memove

The possible cases to handle are:
1. User provides the same destination and either source or mask images to some
of the pixman functions with overlapping source and destination areas.
2. User provides same memory area, but wrapped into different pixman images
to pixman rendering function, so that source and destination pixel data still
overlaps

Generic path in pixman fetches lines into a temporary buffer, so everything
should be fine in horizontal direction. For vertical direction, reversing the
order of handling lines (and using negative stride) could be also possible.
So if overlapped rendering is to be supported, technically there should not be
any big problems. Performance would be worse in the case when overlapped
rendering has to be sent to the generic path, but it would provide incorrect
(or unexpected) results in the fast path anyway. Overlapping-aware fast path
functions can be also created.

What do you think about the whole issue?

--
Best regards,
Siarhei Siamashka
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Jonathan Morton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 2009-06-03 at 12:42 +0300, Siarhei Siamashka wrote:
> Hi,
>
> What kind of guarantees (or the lack of) pixman and XRender are supposed to
> provide when dealing with overlapping parts of images?
>
> The practical use case could be scrolling of data inside of a single big
> image. If rendering with overlapped source and destination areas is not
> supported, a temporary image has to be created to achieve expected result
> and this is an additional performance hit.

> What do you think about the whole issue?

We're seeing exactly this kind of problem when scrolling in Firefox.  We
can speed up the primitives that Firefox uses to draw, but at the moment
scrolling is limited by the interface from Xorg to Pixman used for
XCopyArea.  We know that Pixman is being used, because we see a
performance improvement when we turn on a NEON version of pixman_blt().

When an overlapped XCopyArea is issued - as is typical for a short
scroll in any direction - XCopyArea breaks up the request into
individual scanline fetches and puts.  This produces correct results,
but because each scanline is done via two full Pixman call chains, it is
hideously slow.  By that I mean that it can be about a third of the
speed of a full copy-out followed by copy-back.

(In fact, I think XCopyArea does this regardless of whether the areas
actually overlap, if the source and dest images are the same.  Ugh.)

If Pixman were able to guarantee that "reasonable" overlapped copy
requests would behave like a "move", this highly inefficient special
case could be removed from Xorg, and there would be more opportunities
for optimisation within Pixman itself, where they are more likely to be
accepted.  "Reasonable" in this case could mean that the same image must
be referenced for both ends, rather than the "accidental overlap"
scenario.

The row-stride argument of the blitters already accepts a signed
argument, so in theory this should work just fine.

--
------
From: Jonathan Morton
      jonathan.morton@...


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Siarhei Siamashka <siarhei.siamashka@...> writes:

> What kind of guarantees (or the lack of) pixman and XRender are supposed to
> provide when dealing with overlapping parts of images?

(Adding xorg-devel). See this thread:

    http://lists.freedesktop.org/archives/xorg/2008-October/039346.html

The guarantee that I would suggest for Render and pixman is that if
any pixel is both read and written in the same request, then the
result of the whole request is undefined, except that it obeys the
clipping rules.

> The practical use case could be scrolling of data inside of a single big
> image. If rendering with overlapped source and destination areas is not
> supported, a temporary image has to be created to achieve expected result
> and this is an additional performance hit.

Yes, scrolling is one thing that the current pixman API doesn't really
provide. 'pixman_blt()' only deals with cases where the source and
destination don't overlap.

I think the best solution is to move all of the X primitives
(CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
probably look something like this:

        void
        pixman_copy_area (pixman_image_t *src,
                          pixman_image_t *dest,
                          pixman_gc_t *gc,
                          int src_x,
                          int src_y,
                          int width,
                          int height,
                          int dest_x,
                          int dest_y);

and that would be guaranteed to handle overlapping src and dest. A
pixman_gc_t would be a new type of object, corresponding to an X GC.

pixman_blt() would then become a deprecated wrapper that would just
call pixman_copy_area(). Same for pixman_fill() and a new
pixman_fill_rectangles().

> Generic path in pixman fetches lines into a temporary buffer, so everything
> should be fine in horizontal direction. For vertical direction, reversing the
> order of handling lines (and using negative stride) could be also possible.
> So if overlapped rendering is to be supported, technically there should not be
> any big problems. Performance would be worse in the case when overlapped
> rendering has to be sent to the generic path, but it would provide incorrect
> (or unexpected) results in the fast path anyway. Overlapping-aware fast path
> functions can be also created.

I am not necessarily opposed to making Render guarantee correct
operation for overlapping sources and destinations if someone wants to
fix the implementation. However,

- many of the operators aren't all that useful when used with
  overlapping sources and destinations;

- to be useful for scrolling windows, applications need to get
  GraphicsExposure events, and Render currently doesn't generate
  those;

- for images with transformations, you will need to make a copy
  anyway;

- you'll need to fix the server and drivers so that they either
  implement this correctly, or fall back to software.


Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Bill Spitzak-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Cairo can do a transformation when copying (like a rotation or scale). I
think it is best to say that no guarantee is made when the source and
destination intersect. You could say it works if there is no
transformation other than integer translation but it seems a little
artificial to me. It seems most consistent to say that nothing is allowed.

Also on modern graphics systems it is probably a lot easier to write and
just as fast to just draw the entire image over again in it's new
position, rather than trying to copy areas. Or if copying an area really
does help, making the entire image in an off-screen surface and copying
from that would work better.

Soeren Sandmann wrote:
> Siarhei Siamashka <siarhei.siamashka@...> writes:
>
>> What kind of guarantees (or the lack of) pixman and XRender are supposed to
>> provide when dealing with overlapping parts of images?


--
Bill Spitzak, Senior Software Engineer
The Foundry, 1 Wardour Street, London, W1D 6PA, UK
Tel: +44 (0)20 7434 0449 * Fax: +44 (0)20 7434 1550 * Web:
www.thefoundry.co.uk
The Foundry Visionmongers Ltd * Registered in England and Wales No: 4642027

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thursday 04 June 2009, Soeren Sandmann wrote:

> Siarhei Siamashka <siarhei.siamashka@...> writes:
> > What kind of guarantees (or the lack of) pixman and XRender are supposed
> > to provide when dealing with overlapping parts of images?
>
> (Adding xorg-devel). See this thread:
>
>     http://lists.freedesktop.org/archives/xorg/2008-October/039346.html
>
> The guarantee that I would suggest for Render and pixman is that if
> any pixel is both read and written in the same request, then the
> result of the whole request is undefined, except that it obeys the
> clipping rules.
>
> > The practical use case could be scrolling of data inside of a single big
> > image. If rendering with overlapped source and destination areas is not
> > supported, a temporary image has to be created to achieve expected result
> > and this is an additional performance hit.
>
> Yes, scrolling is one thing that the current pixman API doesn't really
> provide. 'pixman_blt()' only deals with cases where the source and
> destination don't overlap.
>
> I think the best solution is to move all of the X primitives
> (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
> probably look something like this:
>
>         void
>         pixman_copy_area (pixman_image_t *src,
>                           pixman_image_t *dest,
>                           pixman_gc_t *gc,
>                           int src_x,
>                           int src_y,
>                           int width,
>                           int height,
>                           int dest_x,
>                           int dest_y);
>
> and that would be guaranteed to handle overlapping src and dest. A
> pixman_gc_t would be a new type of object, corresponding to an X GC.
>
> pixman_blt() would then become a deprecated wrapper that would just
> call pixman_copy_area(). Same for pixman_fill() and a new
> pixman_fill_rectangles().

I'm not sure about pixman_gc_t since most of the needed operations are just
simple copies. What about starting with just introducing a variant
of 'pixman_blt' which is overlapping aware?

I created a work-in-progress branch with 'pixman_blt' function (generic C
implementation for now) extended to support overlapped source/destination
case. A simple test program is also included:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt

Making use of the already existing SIMD optimized pixel copy functions should
provide fast scrolling in all the directions except for from left to right.
This special case will require a SIMD optimized backwards copy.

I wonder if it makes sense to drop delegates support for pixman_blt and make
call chain shorter when introducing SIMD optimized copies? It seems to be a
little bit overdesigned here.

Running test program for the current pixman master (SSE2 optimized):

$ test/overlapped-blt-test
bpp=8, supported=0, normal_ok=0, overlapped_ok=0
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0

Running test program for the pixman from this branch with generic
C version of pixman_blt (8bpp now uses a fallback to generic C
implementation):

$ test/overlapped-blt-test
bpp=8, supported=1, normal_ok=1, overlapped_ok=1
bpp=16, supported=1, normal_ok=1, overlapped_ok=0
bpp=32, supported=1, normal_ok=1, overlapped_ok=0


--
Best regards,
Siarhei Siamashka
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

> On Thursday 04 June 2009, Soeren Sandmann wrote:
> > Siarhei Siamashka <siarhei.siamashka@...> writes:
> > > What kind of guarantees (or the lack of) pixman and XRender are supposed
> > > to provide when dealing with overlapping parts of images?
> >
> > (Adding xorg-devel). See this thread:
> >
> >     http://lists.freedesktop.org/archives/xorg/2008-October/039346.html
> >
> > The guarantee that I would suggest for Render and pixman is that if
> > any pixel is both read and written in the same request, then the
> > result of the whole request is undefined, except that it obeys the
> > clipping rules.
> >
> > > The practical use case could be scrolling of data inside of a single big
> > > image. If rendering with overlapped source and destination areas is not
> > > supported, a temporary image has to be created to achieve expected result
> > > and this is an additional performance hit.
> >
> > Yes, scrolling is one thing that the current pixman API doesn't really
> > provide. 'pixman_blt()' only deals with cases where the source and
> > destination don't overlap.
> >
> > I think the best solution is to move all of the X primitives
> > (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
> > probably look something like this:
> >
> >         void
> >         pixman_copy_area (pixman_image_t *src,
> >                           pixman_image_t *dest,
> >                           pixman_gc_t *gc,
> >                           int src_x,
> >                           int src_y,
> >                           int width,
> >                           int height,
> >                           int dest_x,
> >                           int dest_y);
> >
> > and that would be guaranteed to handle overlapping src and dest. A
> > pixman_gc_t would be a new type of object, corresponding to an X GC.
> >
> > pixman_blt() would then become a deprecated wrapper that would just
> > call pixman_copy_area(). Same for pixman_fill() and a new
> > pixman_fill_rectangles().
>
> I'm not sure about pixman_gc_t since most of the needed operations are just
> simple copies. What about starting with just introducing a variant
> of 'pixman_blt' which is overlapping aware?

The pixman_blt() interface is misdesigned for two reasons: (1) the
strides are given in number-of-uint32_ts, which gratuitously limits
the types of images that can be processed, and (2) it can fail if it
doesn't like the input for some reason.

At the same time, having the core primitives available on the client
side is useful in some cases, and the software implementation of them
can more easily be optimized with SIMD instructions in pixman.

Moving core rendering into pixman solves both issues at the same time.

But that said, I am not opposed to extending pixman_blt() to support
overlapping copies. That is certainly a simpler first step.

> I created a work-in-progress branch with 'pixman_blt' function (generic C
> implementation for now) extended to support overlapped source/destination
> case. A simple test program is also included:
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
>
> Making use of the already existing SIMD optimized pixel copy functions should
> provide fast scrolling in all the directions except for from left to right.
> This special case will require a SIMD optimized backwards copy.
>
> I wonder if it makes sense to drop delegates support for pixman_blt and make
> call chain shorter when introducing SIMD optimized copies? It seems to be a
> little bit overdesigned here.

How would you support SSE2 and MMX in the same binary then?

Also, I really don't see much potential for saving here. For a NEON
implementation of blt, the callchain would be:

   pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()

and getting rid of delegates wouldn't really affect that at all. You
could get rid of the _pixman_implementation_blt() call by making it a
macro, but as I mentioned before, gcc turns it into a tail call that
reused the arguments on the stack, so the overhead really is minimal.


Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tuesday 20 October 2009, Soeren Sandmann wrote:
[...]

> > I'm not sure about pixman_gc_t since most of the needed operations are
> > just simple copies. What about starting with just introducing a variant
> > of 'pixman_blt' which is overlapping aware?
>
> The pixman_blt() interface is misdesigned for two reasons: (1) the
> strides are given in number-of-uint32_ts, which gratuitously limits
> the types of images that can be processed, and (2) it can fail if it
> doesn't like the input for some reason.
>
> At the same time, having the core primitives available on the client
> side is useful in some cases, and the software implementation of them
> can more easily be optimized with SIMD instructions in pixman.
>
> Moving core rendering into pixman solves both issues at the same time.

I don't have any strong opinion about API updates. In any case, smooth
upgrade path needs to be taken care of and the users should be prevented
from using incompatible versions of client applications/libraries and
pixmap. An introduction of a new function may be the best way, it can also
solve some of the design issues.

> But that said, I am not opposed to extending pixman_blt() to support
> overlapping copies. That is certainly a simpler first step.

Yes, the functionality itself can be introduced first (without breaking
anything). Wrapping it into a better API can be done as the natural next
step.

> > I created a work-in-progress branch with 'pixman_blt' function (generic C
> > implementation for now) extended to support overlapped source/destination
> > case. A simple test program is also included:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
> >
> > Making use of the already existing SIMD optimized pixel copy functions
> > should provide fast scrolling in all the directions except for from left
> > to right. This special case will require a SIMD optimized backwards copy.
> >
> > I wonder if it makes sense to drop delegates support for pixman_blt and
> > make call chain shorter when introducing SIMD optimized copies? It seems
> > to be a little bit overdesigned here.
>
> How would you support SSE2 and MMX in the same binary then?

The most simple way is to do it in my opinion is the following.

First introduce something like 'pixman_init' function. Right now CPU type
detection is done on the first call to the function. It introduces some
minor overhead by having an extra pointer check on each function call.
Another problem is that we can't be completely sure that CPU capabilities
detection check is always fully reentrant. For example, some platforms may
try to set a signal handler and expect to catch SIGILL or something like
this.

This initialization function would just detect CPU capabilities and set some
function pointers. The whole CPU-specific implementation of 'pixman_blt'
may be just called via this pointer directly by a client. Or 'pixman_blt' can
be just a small thunk which does a call via function pointer, passes exactly
the same arguments to it and does nothing more. In this case there will be
really no excuse for the compiler for not using tail call, see below.

> Also, I really don't see much potential for saving here. For a NEON
> implementation of blt, the callchain would be:
>
>    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
>
> and getting rid of delegates wouldn't really affect that at all. You
> could get rid of the _pixman_implementation_blt() call by making it a
> macro, but as I mentioned before, gcc turns it into a tail call that
> reused the arguments on the stack, so the overhead really is minimal.

On what kind of platform and with which version of gcc are you getting
proper tail call here? I don't see it being used and the overhead is
rather hefty, which is also confirmed by benchmarking and profiling.

Even if gcc can reuse some part of the arguments which are already on
stack in some cases, different platforms may have different ABI and calling
conventions. For example, for ARM and x86-64, the first few arguments
are passed in registers, the rest is on stack. Relying on the compiler to
always do the job properly identifying tail call possibilities in all cases
may be not the very best idea.

C:

PIXMAN_EXPORT pixman_bool_t
pixman_blt (uint32_t *src_bits,
            uint32_t *dst_bits,
            int       src_stride,
            int       dst_stride,
            int       src_bpp,
            int       dst_bpp,
            int       src_x,
            int       src_y,
            int       dst_x,
            int       dst_y,
            int       width,
            int       height)
{
    if (!imp)
        imp = _pixman_choose_implementation ();

    return _pixman_implementation_blt (imp, src_bits, dst_bits, src_stride,
                                       dst_stride,
                                       src_bpp, dst_bpp,
                                       src_x, src_y,
                                       dst_x, dst_y,
                                       width, height);
}

x86, gcc 4.3.2:

00000420 <pixman_blt>:
 420:   55                      push   %ebp
 421:   89 e5                   mov    %esp,%ebp
 423:   83 ec 38                sub    $0x38,%esp
 426:   8b 15 00 00 00 00       mov    0x0,%edx
 42c:   85 d2                   test   %edx,%edx
 42e:   74 68                   je     498 <pixman_blt+0x78>
 430:   8b 45 34                mov    0x34(%ebp),%eax
 433:   89 44 24 30             mov    %eax,0x30(%esp)
 437:   8b 45 30                mov    0x30(%ebp),%eax
 43a:   89 44 24 2c             mov    %eax,0x2c(%esp)
 43e:   8b 45 2c                mov    0x2c(%ebp),%eax
 441:   89 44 24 28             mov    %eax,0x28(%esp)
 445:   8b 45 28                mov    0x28(%ebp),%eax
 448:   89 44 24 24             mov    %eax,0x24(%esp)
 44c:   8b 45 24                mov    0x24(%ebp),%eax
 44f:   89 44 24 20             mov    %eax,0x20(%esp)
 453:   8b 45 20                mov    0x20(%ebp),%eax
 456:   89 44 24 1c             mov    %eax,0x1c(%esp)
 45a:   8b 45 1c                mov    0x1c(%ebp),%eax
 45d:   89 44 24 18             mov    %eax,0x18(%esp)
 461:   8b 45 18                mov    0x18(%ebp),%eax
 464:   89 44 24 14             mov    %eax,0x14(%esp)
 468:   8b 45 14                mov    0x14(%ebp),%eax
 46b:   89 44 24 10             mov    %eax,0x10(%esp)
 46f:   8b 45 10                mov    0x10(%ebp),%eax
 472:   89 44 24 0c             mov    %eax,0xc(%esp)
 476:   8b 45 0c                mov    0xc(%ebp),%eax
 479:   89 44 24 08             mov    %eax,0x8(%esp)
 47d:   8b 45 08                mov    0x8(%ebp),%eax
 480:   89 44 24 04             mov    %eax,0x4(%esp)
 484:   a1 00 00 00 00          mov    0x0,%eax
 489:   89 04 24                mov    %eax,(%esp)
 48c:   e8 fc ff ff ff          call   48d <pixman_blt+0x6d>
 491:   c9                      leave
 492:   c3                      ret
 493:   90                      nop
 494:   8d 74 26 00             lea    0x0(%esi),%esi
 498:   e8 fc ff ff ff          call   499 <pixman_blt+0x79>
 49d:   a3 00 00 00 00          mov    %eax,0x0
 4a2:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
 4a8:   eb 86                   jmp    430 <pixman_blt+0x10>
 4aa:   8d b6 00 00 00 00       lea    0x0(%esi),%esi


ARM, gcc 4.3.4:

000003a4 <pixman_blt>:
 3a4:   e92d41f0        push    {r4, r5, r6, r7, r8, lr}
 3a8:   e59f4088        ldr     r4, [pc, #136]  ; 438 <pixman_blt+0x94>
 3ac:   e1a08001        mov     r8, r1
 3b0:   e24dd028        sub     sp, sp, #40     ; 0x28
 3b4:   e5941000        ldr     r1, [r4]
 3b8:   e1a06000        mov     r6, r0
 3bc:   e3510000        cmp     r1, #0  ; 0x0
 3c0:   e1a07002        mov     r7, r2
 3c4:   e1a05003        mov     r5, r3
 3c8:   0a000017        beq     42c <pixman_blt+0x88>
 3cc:   e59dc040        ldr     ip, [sp, #64]
 3d0:   e5940000        ldr     r0, [r4]
 3d4:   e59de044        ldr     lr, [sp, #68]
 3d8:   e58dc004        str     ip, [sp, #4]
 3dc:   e59dc048        ldr     ip, [sp, #72]
 3e0:   e58de008        str     lr, [sp, #8]
 3e4:   e59de04c        ldr     lr, [sp, #76]
 3e8:   e58dc00c        str     ip, [sp, #12]
 3ec:   e59dc050        ldr     ip, [sp, #80]
 3f0:   e58de010        str     lr, [sp, #16]
 3f4:   e59d405c        ldr     r4, [sp, #92]
 3f8:   e58dc014        str     ip, [sp, #20]
 3fc:   e59de054        ldr     lr, [sp, #84]
 400:   e59dc058        ldr     ip, [sp, #88]
 404:   e1a01006        mov     r1, r6
 408:   e1a02008        mov     r2, r8
 40c:   e1a03007        mov     r3, r7
 410:   e58d5000        str     r5, [sp]
 414:   e58de018        str     lr, [sp, #24]
 418:   e58dc01c        str     ip, [sp, #28]
 41c:   e58d4020        str     r4, [sp, #32]
 420:   ebfffffe        bl      0 <_pixman_implementation_blt>
 424:   e28dd028        add     sp, sp, #40     ; 0x28
 428:   e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
 42c:   ebfffffe        bl      0 <_pixman_choose_implementation>
 430:   e5840000        str     r0, [r4]
 434:   eaffffe4        b       3cc <pixman_blt+0x28>
 438:   00000000        .word   0x00000000

--
Best regards,
Siarhei Siamashka
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Siarhei Siamashka <siarhei.siamashka@...> writes:

> First introduce something like 'pixman_init' function. Right now CPU type
> detection is done on the first call to the function. It introduces some
> minor overhead by having an extra pointer check on each function call.
> Another problem is that we can't be completely sure that CPU capabilities
> detection check is always fully reentrant. For example, some platforms may
> try to set a signal handler and expect to catch SIGILL or something like
> this.
>
> This initialization function would just detect CPU capabilities and set some
> function pointers. The whole CPU-specific implementation of 'pixman_blt'
> may be just called via this pointer directly by a client. Or 'pixman_blt' can
> be just a small thunk which does a call via function pointer, passes exactly
> the same arguments to it and does nothing more. In this case there will be
> really no excuse for the compiler for not using tail call, see
> below.

Adding a pixman_init() that applications would be required to call
first, would not be a compatible change. If we are designing new API,
then I really think it should be done in such a way that it can be
extended to handle the core rendering primitives.

It does likely make sense to make the pixman_implementation_t type
public at some point (renamed to pixman_t probably) and then pass it
directly to the various entry points. This would be necessary if we
add hardware acceleration to pixman.

> > Also, I really don't see much potential for saving here. For a NEON
> > implementation of blt, the callchain would be:
> >
> >    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
> >
> > and getting rid of delegates wouldn't really affect that at all. You
> > could get rid of the _pixman_implementation_blt() call by making it a
> > macro, but as I mentioned before, gcc turns it into a tail call that
> > reused the arguments on the stack, so the overhead really is minimal.
>
> On what kind of platform and with which version of gcc are you getting
> proper tail call here?

I meant that the

        _pixman_implemenation_blt() -> neon_blt()

would be a tail call. GCC v 4.3.2 on x86-32 produces:

        _pixman_implementation_blt:
                pushl   %ebp
                movl    %esp, %ebp
                movl    8(%ebp), %edx
                popl    %ebp
                movl    12(%edx), %ecx
                jmp     *%ecx
                .size   _pixman_implementation_blt,
                .-_pixman_implementation_blt
                .p2align 4,,15

> I don't see it being used and the overhead is rather hefty, which is
> also confirmed by benchmarking and profiling.

Well, with a microbenchmark you can make anything stand out.
Ultimately, this function is called from XCopyArea(), and compared to
the marshalling of the client call and the long call chain inside the
X server, these 35 instructions or so, really are not very
significant.

I think Jonathan said that pixman_blt() was getting called once per
scanline, but I'm pretty sure that's not the case. (Or if it is, that
would be the first thing to fix before worrying about eliminating this
call).


Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday 21 October 2009, Soeren Sandmann wrote:

> Siarhei Siamashka <siarhei.siamashka@...> writes:
> > First introduce something like 'pixman_init' function. Right now CPU type
> > detection is done on the first call to the function. It introduces some
> > minor overhead by having an extra pointer check on each function call.
> > Another problem is that we can't be completely sure that CPU capabilities
> > detection check is always fully reentrant. For example, some platforms
> > may try to set a signal handler and expect to catch SIGILL or something
> > like this.
> >
> > This initialization function would just detect CPU capabilities and set
> > some function pointers. The whole CPU-specific implementation of
> > 'pixman_blt' may be just called via this pointer directly by a client. Or
> > 'pixman_blt' can be just a small thunk which does a call via function
> > pointer, passes exactly the same arguments to it and does nothing more.
> > In this case there will be really no excuse for the compiler for not
> > using tail call, see
> > below.
>
> Adding a pixman_init() that applications would be required to call
> first, would not be a compatible change. If we are designing new API,
> then I really think it should be done in such a way that it can be
> extended to handle the core rendering primitives.
OK, then let's not touch pixman API for now :)

> It does likely make sense to make the pixman_implementation_t type
> public at some point (renamed to pixman_t probably) and then pass it
> directly to the various entry points. This would be necessary if we
> add hardware acceleration to pixman.
>
> > > Also, I really don't see much potential for saving here. For a NEON
> > > implementation of blt, the callchain would be:
> > >
> > >    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
> > >
> > > and getting rid of delegates wouldn't really affect that at all. You
> > > could get rid of the _pixman_implementation_blt() call by making it a
> > > macro, but as I mentioned before, gcc turns it into a tail call that
> > > reused the arguments on the stack, so the overhead really is minimal.
Could you have a look and review the patches from the following branch?
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2

It should be more or less final and it adds full support for pixman_blt
source/destination areas overlapping to pixman. Also delegates are removed
for pixman_blt, they really look like an overkill for this simple function.

> > On what kind of platform and with which version of gcc are you getting
> > proper tail call here?
>
> I meant that the
>
>         _pixman_implemenation_blt() -> neon_blt()
>
> would be a tail call. GCC v 4.3.2 on x86-32 produces:
>
>         _pixman_implementation_blt:
>                 pushl   %ebp
>                 movl    %esp, %ebp
>                 movl    8(%ebp), %edx
>                 popl    %ebp
>                 movl    12(%edx), %ecx
>                 jmp     *%ecx
>                 .size   _pixman_implementation_blt,
>                 .-_pixman_implementation_blt
>                 .p2align 4,,15
OK, thanks, I see.

> > I don't see it being used and the overhead is rather hefty, which is
> > also confirmed by benchmarking and profiling.
>
> Well, with a microbenchmark you can make anything stand out.
> Ultimately, this function is called from XCopyArea(), and compared to
> the marshalling of the client call and the long call chain inside the
> X server, these 35 instructions or so, really are not very
> significant.

The code around XCopyArea also needs some cleanups and optimizations.

Application benchmarks show that generally only a fraction of time is spent
in the leaf pixel processing functions. The rest is spread across various
layers. It's quite hard to start optimizing and simplifying all this stuff
(and see the real effect) because lots of small cumulative performance losses
can be found in a lot of places. Not all the images are large enough to ignore
call overhead, there are also small icons, UI elements, fonts...

Long call chains are also bad because processors usually have some limit on
the depth of return address prediction. It's just 8 for ARM Cortex-A8, 12 for
the original AMD Athlon, 24 for AMD Phenom. Intel most likely also has some
limit here, but I did not find it easily. So when going up and down through
some insanely long call chains frequently, a lot of function returns may
suffer from mispredict penalty. And such chains may be very long because
they may be originating from the user application and come through lots of
layers.

Whenever it's easy to remove some of the redundant nested calls, it's better
to do this. Also callgraphs will have less boxes and will become easier to
decipher :)

--
Best regards,
Siarhei Siamashka


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

signature.asc (196 bytes) Download Attachment

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 19-10-09 23:13, Siarhei Siamashka wrote:

> On Thursday 04 June 2009, Soeren Sandmann wrote:
>> Siarhei Siamashka<siarhei.siamashka@...>  writes:
>>> What kind of guarantees (or the lack of) pixman and XRender are supposed
>>> to provide when dealing with overlapping parts of images?
>>
>> (Adding xorg-devel). See this thread:
>>
>>      http://lists.freedesktop.org/archives/xorg/2008-October/039346.html
>>
>> The guarantee that I would suggest for Render and pixman is that if
>> any pixel is both read and written in the same request, then the
>> result of the whole request is undefined, except that it obeys the
>> clipping rules.
>>
>>> The practical use case could be scrolling of data inside of a single big
>>> image. If rendering with overlapped source and destination areas is not
>>> supported, a temporary image has to be created to achieve expected result
>>> and this is an additional performance hit.
>>
>> Yes, scrolling is one thing that the current pixman API doesn't really
>> provide. 'pixman_blt()' only deals with cases where the source and
>> destination don't overlap.
>>
>> I think the best solution is to move all of the X primitives
>> (CopyArea, DrawLine, DrawArc, etc.) into pixman. For CopyArea it would
>> probably look something like this:
>>
>>          void
>>          pixman_copy_area (pixman_image_t *src,
>>                            pixman_image_t *dest,
>>                            pixman_gc_t *gc,
>>                            int src_x,
>>                            int src_y,
>>                            int width,
>>                            int height,
>>                            int dest_x,
>>                            int dest_y);
>>
>> and that would be guaranteed to handle overlapping src and dest. A
>> pixman_gc_t would be a new type of object, corresponding to an X GC.
>>
>> pixman_blt() would then become a deprecated wrapper that would just
>> call pixman_copy_area(). Same for pixman_fill() and a new
>> pixman_fill_rectangles().
>
> I'm not sure about pixman_gc_t since most of the needed operations are just
> simple copies. What about starting with just introducing a variant
> of 'pixman_blt' which is overlapping aware?
>
> I created a work-in-progress branch with 'pixman_blt' function (generic C
> implementation for now) extended to support overlapped source/destination
> case. A simple test program is also included:
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt

Would using said branch give me 'magically' a performance boost (e.g.
not make firefox unusably slow as it is now on an 600MHz cortex a8) or
would I need to patch other libs (e.g. xrender) as well?

regards,

Koen

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Friday 23 October 2009, Koen Kooi wrote:
> > I'm not sure about pixman_gc_t since most of the needed operations are just
> > simple copies. What about starting with just introducing a variant
> > of 'pixman_blt' which is overlapping aware?
> >
> > I created a work-in-progress branch with 'pixman_blt' function (generic C
> > implementation for now) extended to support overlapped source/destination
> > case. A simple test program is also included:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
 
First, this branch is outdated. There is a new branch with the final code :)
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2

> Would using said branch give me 'magically' a performance boost (e.g.
> not make firefox unusably slow as it is now on an 600MHz cortex a8) or
> would I need to patch other libs (e.g. xrender) as well?

Not really, it's just a small extension of pixman functionality. Currently
the handling of overlapped blt operation (for software rendering) is done
in xorg-server. As it is the responsibility of pixman to provide CPU-specific
SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to
move this work to pixman. So the next steps are to add NEON optimizations
to pixman_plt and make sure that xserver takes advantage of these
optimizations for the overlapped blit too.

As for improving scrolling performance (and assuming a standard fbdev driver),
the most important thing is to improve framebuffer memory performance. Right
now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling
write-through cache for it (with a simple kernel patch) improves scrolling
and moving windows performance by 4x-5x factor (unless shadow framebuffer is
used, which is also not good for performance). This works fine if nothing
but CPU can modify framebuffer memory. But if GPU or DSP can also access
framebuffer memory or compositing manager is used, everything gets more
complicated. Cache invalidate operations will have to be inserted in
appropriate places in order to ensure memory coherency and uniform view
of its content from all the units. If default write-back cache is used
instead of write-through, cache flush operations are needed too.

Unpatched firefox is also quite slow for another reason - it tries to
always work with 32bpp data internally, no matter what color depth is
used for desktop.

--
Best regards,
Siarhei Siamashka
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 26-10-09 01:57, Siarhei Siamashka wrote:

> On Friday 23 October 2009, Koen Kooi wrote:
>>> I'm not sure about pixman_gc_t since most of the needed operations are just
>>> simple copies. What about starting with just introducing a variant
>>> of 'pixman_blt' which is overlapping aware?
>>>
>>> I created a work-in-progress branch with 'pixman_blt' function (generic C
>>> implementation for now) extended to support overlapped source/destination
>>> case. A simple test program is also included:
>>> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
>
> First, this branch is outdated. There is a new branch with the final code :)
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt-v2
>
>> Would using said branch give me 'magically' a performance boost (e.g.
>> not make firefox unusably slow as it is now on an 600MHz cortex a8) or
>> would I need to patch other libs (e.g. xrender) as well?
>
> Not really, it's just a small extension of pixman functionality. Currently
> the handling of overlapped blt operation (for software rendering) is done
> in xorg-server. As it is the responsibility of pixman to provide CPU-specific
> SIMD optimizations (NEON for ARM Cortex-A8), it would be quite natural to
> move this work to pixman. So the next steps are to add NEON optimizations
> to pixman_plt and make sure that xserver takes advantage of these
> optimizations for the overlapped blit too.

So:

1) merge your branch into pixman master
2) move overlapped blit handling from xserver-xorg to pixman
3) add SIMD optimizations to pixman

Would give us better scrolling, right?

> As for improving scrolling performance (and assuming a standard fbdev driver),
> the most important thing is to improve framebuffer memory performance. Right
> now framebuffer memory is mapped as noncached writecombine on OMAP3. Enabling
> write-through cache for it (with a simple kernel patch) improves scrolling
> and moving windows performance by 4x-5x factor (unless shadow framebuffer is
> used, which is also not good for performance). This works fine if nothing
> but CPU can modify framebuffer memory. But if GPU or DSP can also access
> framebuffer memory or compositing manager is used, everything gets more
> complicated. Cache invalidate operations will have to be inserted in
> appropriate places in order to ensure memory coherency and uniform view
> of its content from all the units. If default write-back cache is used
> instead of write-through, cache flush operations are needed too.

I have no idea how the sgx or dsp handle the framebuffer, but I'm using
both.

> Unpatched firefox is also quite slow for another reason - it tries to
> always work with 32bpp data internally, no matter what color depth is
> used for desktop.

I'm already using your patch for that :)

regards,

Koen

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Siarhei Siamashka <siarhei.siamashka@...> writes:

> write-through cache for it (with a simple kernel patch) improves scrolling
> and moving windows performance by 4x-5x factor (unless shadow framebuffer is
> used, which is also not good for performance).

What is the issue with the shadow framebuffer? It does add some extra
memory traffic from the final copy, but I would have thought that
everything else becoming much faster, would more than make up for it.


Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: [RFC] Pixman & compositing with overlapping source and destination pixel data

by Jonathan Morton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 2009-10-26 at 17:52 +0100, Soeren Sandmann wrote:

> > write-through cache for it (with a simple kernel patch) improves scrolling
> > and moving windows performance by 4x-5x factor (unless shadow framebuffer is
> > used, which is also not good for performance).
>
> What is the issue with the shadow framebuffer? It does add some extra
> memory traffic from the final copy, but I would have thought that
> everything else becoming much faster, would more than make up for it.

The shadow framebuffer has one additional source of overhead: the
damage-region tracking it does to accumulate multiple updates into a
single post-copy.  If you do a lot of small writes to the shadow, the
damage region gets complex and slows everything down.  We have a trivial
pathological reproduction case, involving spraying random small
rectangles all over the screen.

It also seems that enabling shadow also forces on backing stores, which
is another unnecessary memory overhead and probably involves more
copying.

It is of course possible to imagine a "sloppy damage" tracker, to reduce
overhead by limiting damage-region complexity.  ISTR writing one for a
VNC server a long time ago, simply because I didn't want or need precise
region tracking.

If we don't want to deal with uncached framebuffers in Pixman, and
nobody seems entirely comfortable with cached physical framebuffers,
maybe it's worth my digging my old code out and seeing if it can be
adapted.

--
------
From: Jonathan Morton
      jonathan.morton@...


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo