Pixman glyph performance, and beyond!

View: New views
7 Messages — Rating Filter:   Alert me  

Pixman glyph performance, and beyond!

by Chris Wilson-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


So I'm reviewing how cairo handles compositing, looking at how we may
drive cairo-gl more efficiently. As part of that process, I've had the
opportunity to remove some overhead from within cairo-image. However,
glyph composition still suffers from substantial overhead since every
glyph is composited separately.

firefox-talos-gfx on a slow Celeron 600MHz:
# Overhead  Symbol
# ........  ......
    23.76%  [.] _pixman_run_fast_path
    23.34%  [.] sse2_composite_add_n_8888_8888_ca
    11.82%  [.] sse2_composite_over_n_8888_8888_ca
     6.31%  [.] pixman_image_composite
     4.69%  [.] walk_region_internal
     4.44%  [.] pixman_blt_sse2
     3.18%  [.] _pixman_image_validate
     2.30%  [.] sse2_composite_over_n_8_8888
     2.23%  [.] pixman_compute_composite_region32
     2.19%  [.] pixman_fill_sse2
     1.91%  [.] sse2_composite

(And to put it in perspective:
    46.35%  /usr/local/lib/libpixman-1.so.0.17.1                                                  
    28.25%  /home/ickle/src/cairo/src/.libs/libcairo.so.2.10905.0                                
    14.24%  [kernel])

Søren has looked at this problem in the past and begun work on
fast-path and faster-fast-path branches, looking to cache prior
fast-path resolutions. These are not yet as effective as one would hope.

How insane would it be to push the get_fast_path() to the user and to be
able to pass in the implementation + composite function instead of
performing the search every time? This would also be useful for spans.
And considering how most cairo operations are first performed to a mask,
cairo could very effectively cache the fast path for its most frequent
operations.

I'm particular interested in suggestions and experiences from the
ARM/NEON guys as they seem to be suffering acutely from similar overheads
in pixman - and so I presume are also looking at this issue.
-ickle
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: Pixman glyph performance, and beyond!

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Chris Wilson <chris@...> writes:

> So I'm reviewing how cairo handles compositing, looking at how we may
> drive cairo-gl more efficiently. As part of that process, I've had the
> opportunity to remove some overhead from within cairo-image. However,
> glyph composition still suffers from substantial overhead since every
> glyph is composited separately.

The X server suffers from essentially the same problem, except worse
because each glyph is actually a new pixman_image_t, so there is
horrible overhead from malloc() etc. for each and every glyph.

> firefox-talos-gfx on a slow Celeron 600MHz:
> # Overhead  Symbol
> # ........  ......
>     23.76%  [.] _pixman_run_fast_path
>     23.34%  [.] sse2_composite_add_n_8888_8888_ca
>     11.82%  [.] sse2_composite_over_n_8888_8888_ca
>      6.31%  [.] pixman_image_composite
>      4.69%  [.] walk_region_internal
>      4.44%  [.] pixman_blt_sse2
>      3.18%  [.] _pixman_image_validate
>      2.30%  [.] sse2_composite_over_n_8_8888
>      2.23%  [.] pixman_compute_composite_region32
>      2.19%  [.] pixman_fill_sse2
>      1.91%  [.] sse2_composite

Are these percentages of the 46.35% below, so it is 23.76% of 46.35% =
11.02% for _pixman_run_fast_path?

Either way, it's clearly not good.

> (And to put it in perspective:
>     46.35%  /usr/local/lib/libpixman-1.so.0.17.1                                                  
>     28.25%  /home/ickle/src/cairo/src/.libs/libcairo.so.2.10905.0                                
>     14.24%  [kernel])
>
> Søren has looked at this problem in the past and begun work on
> fast-path and faster-fast-path branches, looking to cache prior
> fast-path resolutions.

The latest incarnation of that work is the 'flags' branch here:

    http://cgit.freedesktop.org/~sandmann/pixman/log/?h=flags

which contains several optimizations in this area.

Here is a summary of what's in it:

- It moves the computation of the various image properties out of the
  get_fast_path() loop and replaces them bit masks that are much
  faster to check.

- It turns general_composite() and fast_composite_scale_nearest() into
  fast paths, so that all compositing goes through that path.

- It eliminates all the composite methods from pixman_implementation_t

- It adds a fast path cache

- It speeds up the operator 'strength reduction' that Antoine added a
  long time ago, by storing the table more compactly and doing the
  mapping in O(1) time.

I need to clean it up, break it into smaller bits, and send them to
the list for review.

> These are not yet as effective as one would hope.

It might be worthwhile rerunning the benchmark against that branch,
though I suspect there is still some overhead. Almost anything will
show up when the images are as small as glyphs are.
 
> How insane would it be to push the get_fast_path() to the user and to be
> able to pass in the implementation + composite function instead of
> performing the search every time? This would also be useful for spans.
> And considering how most cairo operations are first performed to a mask,
> cairo could very effectively cache the fast path for its most frequent
> operations.

I really think the fast paths need to be kept an implementation
detail, because exposing them would constrain what information about
the images you could rely on to compute the fast path.

For example, right now pixman does not rely on the alignment of the
image data when it selects the fast path. This means someone could
look up a fast path, then go on to use with several
differently-aligned images, which would mean pixman couldn't later on
add alignment optimizations.

However, I do agree that glyph compositing needs to become much faster
in both X and cairo, but I think that a better way would be to move
the Render glyph management code into pixman and expose a new

        pixman_glyph_set_t

along with something like a pixman_composite_glyphs() similar to how
Render works. This would allow both cairo and X to become
substantially faster, while sharing glyph caching code.

For spans, I still think that a polygon image type in pixman is the
way to go, since again this would benefit both X and cairo. There
could certainly be a call to convert it into spans if that is useful
to other cairo backends, so that we wouldn't need to have two
rasterizers.


Thanks,
Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: Pixman glyph performance, and beyond!

by Chris Wilson-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Excerpts from Soeren Sandmann's message of Fri Oct 23 02:37:39 +0100 2009:
> The latest incarnation of that work is the 'flags' branch here:
>
>     http://cgit.freedesktop.org/~sandmann/pixman/log/?h=flags
>
> which contains several optimizations in this area.

[snip]

> It might be worthwhile rerunning the benchmark against that branch,
> though I suspect there is still some overhead. Almost anything will
> show up when the images are as small as glyphs are.

Very effective, Søren, it eliminated the get_fast_path() overhead entirely:

  32.84%  [.] sse2_composite_add_n_8888_8888_ca
  17.13%  [.] sse2_composite_over_n_8888_8888_ca
  15.98%  [.] pixman_image_composite
   5.78%  [.] pixman_blt_sse2
   5.40%  [.] _pixman_image_validate
   3.98%  [.] pixman_compute_composite_region32
   2.12%  [.] pixman_fill_sse2

It looks like it's been absorbed into pixman_image_composite(), but the
runtime improved by over 10% -- indicative that the lookup overhead was
eliminated. Though there is still around 25% to be recovered.

> I really think the fast paths need to be kept an implementation
> detail, because exposing them would constrain what information about
> the images you could rely on to compute the fast path.
>
> For example, right now pixman does not rely on the alignment of the
> image data when it selects the fast path. This means someone could
> look up a fast path, then go on to use with several
> differently-aligned images, which would mean pixman couldn't later on
> add alignment optimizations.

I think you've effectively demonstrated that the overhead from selecting
the fast path should be negligible. So we should move on to the question
of how to push large batches of work to pixman efficiently.

> However, I do agree that glyph compositing needs to become much faster
> in both X and cairo, but I think that a better way would be to move
> the Render glyph management code into pixman and expose a new
>
>         pixman_glyph_set_t
>
> along with something like a pixman_composite_glyphs() similar to how
> Render works. This would allow both cairo and X to become
> substantially faster, while sharing glyph caching code.
>
> For spans, I still think that a polygon image type in pixman is the
> way to go, since again this would benefit both X and cairo. There
> could certainly be a call to convert it into spans if that is useful
> to other cairo backends, so that we wouldn't need to have two
> rasterizers.

I'm actually not so convinced that this the direction that pixman should
be going in. From my perspective cairo requires specific path -> backend
geometry converters, and a polygon rasteriser with a span line interface
has quickly become the default method for pushing masks around. Whereas
traps have been relegated to mostly handling boxes, aside from when the
most efficient wire request we have available is CompositeTraps. (Has
anyone else noticed that the RLE mask for curved geometry is often an
order of magnitude smaller than the equivalent set of trapezoids,
almost as small as the original path?) Similarly, I'd rather not add the
overhead of an independent layer of glyph management. With that bias,
I'd prefer that pixman retained its focus on pixel manipulation routines
and we improve the interfaces for performing large sets of similar
operations.

One issue that we will encounter very soon is the pain caused by forcing
the user to emit cairo_show_glyphs() early for each change in font. This
can be fixed up in the backends that batch requests and use a
consolidated glyph atlas (i.e. there is no level state change and so the
geometry is just accumulated onto the previous operation). [There is
still substantial overhead from cairo doing the analysis on the extra
operations.] Similarly we can move away from an immediate mode, direct
access, pixman - and treat pixman more like a GPU, if it is performant.
-ickle
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: Pixman glyph performance, and beyond!

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Chris,

> > The latest incarnation of that work is the 'flags' branch here:
> >
> >     http://cgit.freedesktop.org/~sandmann/pixman/log/?h=flags
> >
> > which contains several optimizations in this area.
>
> [snip]
>
> > It might be worthwhile rerunning the benchmark against that branch,
> > though I suspect there is still some overhead. Almost anything will
> > show up when the images are as small as glyphs are.
>
> Very effective, Søren, it eliminated the get_fast_path() overhead entirely:
>
>   32.84%  [.] sse2_composite_add_n_8888_8888_ca
>   17.13%  [.] sse2_composite_over_n_8888_8888_ca
>   15.98%  [.] pixman_image_composite
>    5.78%  [.] pixman_blt_sse2
>    5.40%  [.] _pixman_image_validate
>    3.98%  [.] pixman_compute_composite_region32
>    2.12%  [.] pixman_fill_sse2
>
> It looks like it's been absorbed into pixman_image_composite(), but the
> runtime improved by over 10% -- indicative that the lookup overhead was
> eliminated. Though there is still around 25% to be recovered.

Some of those 25% might be recovable from the _pixman_image_validate()
by precomputing the flags, but other than that, I don't think there is
all that much more to be gained without a better interface for glyph
rendering.

> > However, I do agree that glyph compositing needs to become much faster
> > in both X and cairo, but I think that a better way would be to move
> > the Render glyph management code into pixman and expose a new
> >
> >         pixman_glyph_set_t
> >
> > along with something like a pixman_composite_glyphs() similar to how
> > Render works. This would allow both cairo and X to become
> > substantially faster, while sharing glyph caching code.
> >
> > For spans, I still think that a polygon image type in pixman is the
> > way to go, since again this would benefit both X and cairo. There
> > could certainly be a call to convert it into spans if that is useful
> > to other cairo backends, so that we wouldn't need to have two
> > rasterizers.
>
> I'm actually not so convinced that this the direction that pixman should
> be going in. From my perspective cairo requires specific path -> backend
> geometry converters, and a polygon rasteriser with a span line interface
> has quickly become the default method for pushing masks around. Whereas
> traps have been relegated to mostly handling boxes, aside from when the
> most efficient wire request we have available is CompositeTraps. (Has
> anyone else noticed that the RLE mask for curved geometry is often an
> order of magnitude smaller than the equivalent set of trapezoids,
> almost as small as the original path?)

One of the major reasons for adding a polygon image to pixman is that
it would make a PictPolygon render picture possible, thereby finally
eliminating the use of trapezoids, while allowing the X server to do
one-pass compositing of geometry, at least in software.

I am not proposing to tesselate the polygons in pixman, but to
composite them directly scanline by scanline. Using Tor's clever trick
[1] of representing scanlines as accumulation buffers, the operation

        (source IN polygon) OVER dest

can be implemented by

        /* Generate polygon scanline accumulation buffer */
        for reach subpixel scanline
                for each active edge
                        add delta to one pixel in accumulation buffer
                        bresenham step

        uint8_t m = 0
        for each pixel:
                m += polygon[i]
                if (m)
                {
                        s = load_source();
                        d = load_dest()
               
                        d = composite (s, m, d);
                }

which is very SIMD-able and allows us to not allocate and zero-fill a
potentially large temporary mask. I think this would be faster than
the arrays of spans. This would benefit at least the image backend
too, but in any case, there is a clear need to do better than
trapezoids for uploading geometry to X.

It's less convenient for shaders because of the horizontal prefix sum,
which is why there could be a

        pixman_poly_image_make_spans (...)

interface to generate spans for the cairo GPU backends, or a
callback-based one like the one in current cairo. DDX drivers could
use this as well, as could pixman GPU backends.

> Similarly, I'd rather not add the overhead of an independent layer
> of glyph management.

Simply moving the Render glyph code into pixman would be an immediate
improvement for non-hardware-accelerated X servers since the remaining
overhead in the profile above would essentially disappear.

A little longer term, it would allow storing the glyphs as efficiently
as possible for the CPU in question (since pixman already has details
about the CPU). For example, the SSE2 backend could benefit greatly if
the glyphs were stored in a 16 x n image, possibly sorted by
approximate use frequency. This would improve both cache performance
and allow aligned loads to be used.

I'm not sure I understand what 'overhead of an independent layer of
glyph management' you see for cairo. For at least the image backend, I
don't see anything that a pixman_glyph_set_t could not provide, and it
seems like a win to share this code with the X server. For other
backends, it may not be as useful, but it won't be harmful either.

> With that bias, I'd prefer that pixman retained its focus on pixel
> manipulation routines and we improve the interfaces for performing
> large sets of similar operations.

> One issue that we will encounter very soon is the pain caused by forcing
> the user to emit cairo_show_glyphs() early for each change in font. This
> can be fixed up in the backends that batch requests and use a
> consolidated glyph atlas (i.e. there is no level state change and so the
> geometry is just accumulated onto the previous operation). [There is
> still substantial overhead from cairo doing the analysis on the extra
> operations.] Similarly we can move away from an immediate mode, direct
> access, pixman - and treat pixman more like a GPU, if it is performant.

Even with glyphs and polygons in pixman, I agree that there is
potentially a lot of benefit to be had from submitting more work to
pixman at a time.

I don't have a clear proposal for what such an interface would look
like, but here are some things that are worth thinking about:

      - How well does the interface help if pixman is multithreaded?

      - How well does it help if pixman gains GPU backends?

      - Can we eliminate temporary images. Ie., if someone does

                cairo_push_group()
                <some simple stuff>
                cairo_pop_to_source()
                cairo_paint()

        can we do the whole operation in one pass without allocating
        a temporary mask? One way to do this would be to add a new

                pixman_image_composited_t

        image type, that would contain pointers to two other images,
        then composite them on the fly when the toplevel image is
        asked to fetch a scanline. This leads to the idea of an
        'expression tree' of images as the way to submit a lots of
        work.

      - Can it be JIT compiled? There are two quite different
        approaches to JIT compilation:

              - Generate code similar to the current fast paths and
                cache it. This is simpler to get going at first, but
                also fundamentally requires temporary images.

              - Generate one-shot code for a lot of operations at
                once. With the expression tree idea, this might make a
                lot of sense. The compiler could look at the tree and
                generate the code that would produce the least amount
                of memory traffic.

                Shader code could be generated from this as well.

Thanks,
Soren



[1] http://lists.cairographics.org/archives/cairo/2007-August/011136.html
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Parent Message unknown Re: Pixman glyph performance, and beyond!

by Chris Wilson-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Excerpts from Aaron Plattner's message of Fri Oct 23 21:05:07 +0100 2009:
> This is maybe only sort of related, but can we push even more of that work
> to the server?  Modern GPUs can evaluate Bezier curves very efficiently,
> and can rasterize them quickly without having to tesselate at all.  An
> approach that stores the control points in video memory and evaluates them
> using something akin to a display list call should be WAY faster than any
> amount of RLE or fancy span encoding or even sending polygons.  The server
> could always decompose it into spans if the implementation can't do the
> paths itself.

Indeed, this is the goal we are working towards. :-)

Pixman is interesting as it is both the lowest common denominator and
reference implementation, and also serves as a vital baseline against
which we can measure driver performance. If we can not render faster than
the CPU, or if we consume more power using the GPU, then our drivers suck.

At the moment, I'm looking at an incremental approach to move
tessellation to the server, and the first steps along that path would be
to use more compact mask representations and extend the current Render
protocol to match current and desired usage.

We're playing with high quality rendering of paths using the GPU via
cairo-gl. Carl has proposed to talk at LCA on this subject, so the goal
is to have something useful in place for him to demonstrate ;-). The
emphasis here, IMO, has to be to maintain cairo's high image quality -
jittered or multisampled stencil based approaches are vastly inferior
(as is the output of some other *slower* 2D libraries, I'm not bitter
;-), though there is a strong push to use those for animations where
the moving image is more forgiving to jaggies and sparkles.

My ulterior motive is then to push cairo into the X server (with cairo
talking over the wire to a cairo extension, which then uses GL or in
some of my wilder plans DRM), so that we can share the tessellation
code between client and server. Or Gallium, or other grand vision du
jour. But first, cairo-gl needs to achieve cairo-drm levels of
performance on the client (as in be faster than the CPU!), before it
becomes an intriguing prospect of pushing to the server. So in the
meantime, I'm trying to work out why the EXA/UXA drivers are so
slow, and see what we can do to improve the current architecture.
-ickle
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: Pixman glyph performance, and beyond!

by Robert O'Callahan-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Oct 24, 2009 at 10:24 AM, Chris Wilson <chris@...> wrote:
The emphasis here, IMO, has to be to maintain cairo's high image quality -
jittered or multisampled stencil based approaches are vastly inferior
(as is the output of some other *slower* 2D libraries, I'm not bitter
;-), though there is a strong push to use those for animations where
the moving image is more forgiving to jaggies and sparkles.

If there are unavoidable quality/performance tradeoffs, then as a consumer of cairo (Firefox) I would really like to be able to choose our tradeoff point, because most of the time for us performance is more important than quality. Or better said, slow animation or rendering is a much more obvious quality problem than slightly degraded antialiasing.

Rob
--
"He was pierced for our transgressions, he was crushed for our iniquities; the punishment that brought us peace was upon him, and by his wounds we are healed. We all, like sheep, have gone astray, each of us has turned to his own way; and the LORD has laid on him the iniquity of us all." [Isaiah 53:5-6]

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: Pixman glyph performance, and beyond!

by Mike Shaver :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 23, 2009 at 5:37 PM, Robert O'Callahan <robert@...> wrote:
> If there are unavoidable quality/performance tradeoffs, then as a consumer
> of cairo (Firefox) I would really like to be able to choose our tradeoff
> point, because most of the time for us performance is more important than
> quality. Or better said, slow animation or rendering is a much more obvious
> quality problem than slightly degraded antialiasing.

And, indeed, during animations and so forth the quality difference may
not be perceptible, so we might well want to slip into "not quite as
awesome" mode during transitions and animations and scrolling or
whatever, and then do the last frame in "great!" mode.  If we are
doing a bounded animation and can use the GPU, we could even prepare
that visually stunning end frame while the animation was being
processed...

Mike
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo