pixman: New ARM NEON optimizations

View: New views
18 Messages — Rating Filter:   Alert me  

pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

This branch has new ARM NEON optimizations:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update

It uses GNU assembler and its macro preprocessor to generate fast path
functions from a common template, so that only a minor part of the inner
loop code needs to be implemented for each function, saving a lot of work.
For example, the core function for performing x8r8g8b8->r5g6b5 conversion
uses as little as this much code:
http://cgit.freedesktop.org/~siamashka/pixman/commit/?h=arm-neon-update&id=b17297cf15122e5b38c082c9fe6f1ff708b7efa4

The template supports any kind of combinations of source, destination and mask
images with 8bpp, 16bpp and 32bpp color formats (24bpp support is a bit more
tricky, but can be done too). The code has comments which should help to get
the idea about how to implement new fast path functions. If comments are not
sufficient or not completely clear, I will try to update them.

The reasons to use GNU assembler are:
1. Full control over registers allocation (there are not too many of them,
considering that up to 3 images are supported with their strides, pointers,
prefetch stuff). I encountered problems running out of registers with inline
assembly and compiling with frame pointer.
2. This allows the use of more or less advanced macro preprocessor and makes
everything easier. A bit more flexible option would be to use JIT code
generation here (this is actually something to consider later).

Technically, there should be no problem catching up with SSE2, especially if
instructions scheduling perfection could be skipped at the first stage. Right
now only the existing NEON fast path functions are reimplemented plus just a
few more.

Now the thing to solve is how to handle the systems other than linux. There is
a potential problem with ABI compatibility - the functions must be fully
compatible with the calling conventions, etc. For now I'm only sure that
they are compatible with Linux EABI. Most likely the other systems should be
fine too, or will be fine with a few tweaks.

Benchmarks show that this new code generally has the same or substantially
better performance than the current pixman NEON fast path functions. Log
from cairo-perf is attached.

--
Best regards,
Siarhei Siamashka

Speedups
========
image-rgba-???  paint-with-alpha_similar_rgba_over-512-0     11.14 (11.23 0.50%) ->   5.55 (5.74 1.46%):  2.01x speedup
image-rgb-???  one-rectangle_image_rgba_over-512-0     10.80 (10.83 0.49%) ->   5.55 (5.62 0.78%):  1.95x speedup
image-rgb-???  paint-with-alpha_image_rgba_over-512-0     11.14 (11.26 2.48%) ->   5.74 (5.92 1.01%):  1.94x speedup
image-rgba-???   paint_image_rgba_over-512-0     10.80 (10.86 0.21%) ->   5.58 (5.68 0.81%):  1.93x speedup
image-rgba-???  paint_similar_rgba_over-512-0     10.74 (10.77 0.51%) ->   5.62 (5.83 1.43%):  1.91x speedup
image-rgb-???  paint_similar_rgba_over-512-0     10.77 (10.83 0.42%) ->   5.71 (5.80 0.71%):  1.89x speedup
image-rgb-???   paint_image_rgba_over-512-0     10.77 (10.83 0.50%) ->   5.74 (5.74 0.61%):  1.88x speedup
image-rgba-???  one-rectangle_image_rgba_over-512-0     10.80 (10.86 0.35%) ->   5.80 (5.89 0.81%):  1.86x speedup
image-rgba-???  paint-with-alpha_image_rgba_over-256-0      2.59 (2.62 1.88%) ->   1.43 (1.47 0.96%):  1.81x speedup
image-rgba-???   paint_image_rgba_over-256-0      2.65 (2.69 2.32%) ->   1.53 (1.59 3.08%):  1.74x speedup
image-rgba-???  paint_similar_rgba_over-256-0      2.35 (2.41 1.79%) ->   1.37 (1.37 0.03%):  1.71x speedup
image-rgba-???  paint-with-alpha_similar_rgba_over-256-0      2.38 (2.44 2.43%) ->   1.40 (1.40 0.04%):  1.70x speedup
image-rgb-???   paint_image_rgba_over-256-0      2.26 (2.35 2.66%) ->   1.37 (1.43 1.49%):  1.65x speedup
image-rgb-???  paint-with-alpha_similar_rgba_over-256-0      2.44 (2.47 2.26%) ->   1.53 (1.56 2.57%):  1.60x speedup
image-rgb-???  paint-with-alpha_image_rgba_over-256-0      2.41 (2.50 2.43%) ->   1.53 (1.56 0.95%):  1.58x speedup
image-rgb-???  paint_similar_rgba_over-256-0      2.26 (2.35 2.03%) ->   1.43 (1.44 0.03%):  1.57x speedup
image-rgba-???  one-rounded-rectangle_solid_rgba_over-512-0     11.44 (11.54 0.40%) ->   8.09 (8.18 0.47%):  1.42x speedup
image-rgba-???  one-rounded-rectangle_solid_rgb_over-512-0     11.41 (11.57 0.56%) ->   8.18 (8.18 0.01%):  1.40x speedup
image-rgb-???  one-rectangle_similar_rgb_source-512-0      3.69 (3.72 1.06%) ->   3.27 (3.45 2.30%):  1.13x speedup
image-rgb-???  paint_similar_rgba_source-256-0      0.89 (0.89 1.40%) ->   0.79 (0.79 4.80%):  1.12x speedup
image-rgba-???  rectangles_similar_rgba_over-512-0     90.09 (90.76 0.70%) ->  81.18 (83.56 1.15%):  1.11x speedup
image-rgb-???  rectangles_similar_rgba_over-512-0     89.91 (91.43 0.69%) ->  81.60 (82.89 0.81%):  1.10x speedup
image-rgb-???  rectangles_similar_rgba_source-512-0     69.76 (70.98 0.99%) ->  64.64 (65.22 0.40%):  1.08x speedup
image-rgb-???  rectangles_image_rgba_source-512-0     69.55 (70.89 0.95%) ->  65.06 (65.55 0.46%):  1.07x speedup
image-rgba-???     fill_solid_rgb_over-256-0      3.66 (3.75 1.64%) ->   3.45 (3.54 1.56%):  1.06x speedup
image-rgba-???    fill_solid_rgba_over-256-0      3.66 (3.69 1.16%) ->   3.48 (3.48 1.12%):  1.05x speedup
Slowdowns
========
image-rgb-???  paint_solid_rgba_source-256-0      0.28 (0.28 0.00%) ->   0.30 (0.30 4.63%):  1.11x slowdown

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday 14 October 2009, Siarhei Siamashka wrote:
> Hello,
>
> This branch has new ARM NEON optimizations:
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update

[...]

Pushed the first commit to master. It only contains removal of unused and
commented out code. It was safe to apply, and this kind of housekeeping is
needed anyway before introducing something new here.

Regarding the other changes. I'm somewhat tempted to push them too :) After
all, this code provides up to 2x performance improvement over the existing
NEON optimizations. And it is also needed for introducing ARM NEON
optimizations to overlapping-aware pixman_blt function.

There is also pixman fast path microbenchmark program in this branch:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=test-n-bench

It can be used by running:
$ test/lowlevel-blt-bench _

--
Best regards,
Siarhei Siamashka
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 19-10-09 23:35, Siarhei Siamashka wrote:

> On Wednesday 14 October 2009, Siarhei Siamashka wrote:
>> Hello,
>>
>> This branch has new ARM NEON optimizations:
>> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update
>
> [...]
>
> Pushed the first commit to master. It only contains removal of unused and
> commented out code. It was safe to apply, and this kind of housekeeping is
> needed anyway before introducing something new here.
>
> Regarding the other changes. I'm somewhat tempted to push them too :)

I'm running master + your remaining patches and I can't spot any bugs on
omap3. I haven't tried any benchmarking whatsoever.

regards,

Koen

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

> This branch has new ARM NEON optimizations:
> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update

In general, I like this a lot, though I have some questions and
comments about it. See below.

> The reasons to use GNU assembler are:

> 1. Full control over registers allocation (there are not too many of
> them, considering that up to 3 images are supported with their
> strides, pointers, prefetch stuff). I encountered problems running
> out of registers with inline assembly and compiling with frame
> pointer.

> 2. This allows the use of more or less advanced macro preprocessor
> and makes everything easier. A bit more flexible option would be to
> use JIT code generation here (this is actually something to consider
> later).

Yes, JIT compiling is very much worth considering, and in fact, your
general framework pretty much *is* a JIT compiler, except that it runs
at compile time and is slightly difficult to read because it is
written in the GNU as meta-language.

> Technically, there should be no problem catching up with SSE2,
> especially if instructions scheduling perfection could be skipped at
> the first stage. Right now only the existing NEON fast path
> functions are reimplemented plus just a few more.

Right, instruction scheduling is the main disadvantage to using the
assembler as opposed to intrinsics. Cortex A9 is out-of-order I
believe, so it will have different scheduling requirements than the
A8, which in turn likely has different scheduling requirements than
earlier in-order CPUs. Though, a reasonable approach might be to
assume that the A9 will not be very sensitive to scheduling at all
(due to it being out-of-order), and then simply optimize for the A8.

> Now the thing to solve is how to handle the systems other than
> linux. There is a potential problem with ABI compatibility - the
> functions must be fully compatible with the calling conventions,
> etc. For now I'm only sure that they are compatible with Linux
> EABI. Most likely the other systems should be fine too, or will be
> fine with a few tweaks.

I think it's perfectly fine even if it is Linux specific at first;
people interested in other operating systems can feel free to send
patches.

Comments on the code:

I have mostly looked at the general framework, and paid less attention
to the various uses of it, except as a reference to understand how the
framework works. And of course, since I'm unfamiliar with both GNU as
and ARM, I may have misread things.

Overall, I like this a lot. As you say, it seems to be general enough
to support pretty much all the pixman operations, as long as there are
no transformations involved or 24 bpp formats (which are always a
pain).

It would be interesting to do something similar for the SSE2 backend.

General

If possible, I think it would be useful to break down the
composite_composite_function macro into smaller bits, that mirror the
structure of the generated code, and then add a comment to the top of
each sub-routine. For example, a sub-macro for the initialization of
the registers plus setup, one for the left unligned part of the
scanline, one for the middle part, one for the right unaligned part,
one for moving to the next line, and one for doing the small rectangle
part.

I think this would make it easier to grasp what is going on.

Also, in various places, more specific comments would be useful, in
addition to the (generally good) highlevel comments that are already
there.

For example,

* I don't fully understand what the abits argument to the load/store
  functions is supposed to be doing. Does it have to do with masking
  in 0xff as appropriate? Part of this may be that I don't know what

     [ &mem_operand&, :&abits& ]!

  means

* A comment that .regs_shortage really means that H and W are spilled
  to the stack.

and similar.

Though in general, the code is well commented.

Denterleaving:

What is the benefit of deinterleaving?

Why is there is no deinterleaving for the inner loop, and only for the
head and tail? Are the callers supposed to do this themselves if they
want it?

Cache preload:

* As far as I can tell, this macro is preloading with PF_X relative to
  PF_SRC, but PF_X is always an offset into a scanline because you
  subtract ORIG_W whenever it exceed, and PF_SRC is set to SRC, but
  never updated anywhere that I can see. So it preloads different
  parts of the first line over and over?

* It seems like you could save a bunch of registers by simply always
  prefetching some fixed number of pixels ahead of where you are going
  to read. Or alternatively, just dump PF_X and keep the number of
  pixels to prefetch ahead in that register. But presumably there is
  some reason not to do this.

* If prefetch_distance is 0, shouldn't this macro not generate
  anything at all?

* I see that you have (H - 1) stored in the upper 28 bits of PF_CTL,
  but are those bits actually being used for anything other than
  preventing the ldrs? Ie., it will still attempt to use the plds
  below the image, right?

If I'm misreading this completely, it would be good to have some more
detailed comments here.

A couple of things that it may or may not be worth thinking about:

* Maybe add support for skipping unnecessary destination reads? Ie.,
  in the case of OVER, there is no reason to read the destination if
  the mask is 0, or if the combined (SRC IN MASK) is 1.

  In my experience, this is a win for many images that
  happen in practice. Consider these common cases:

         - Text: most pixels are either fully opaque or fully
           transparent, and in either case, no destination read is
           necessary.

         - Rounded antialiased rectangle. The corners are transparent
           and the body is opaque or transparent. Essentially none of
           the pixels actually need the destination.

* Does ARM have the equivalent of movnt? It may or may not be
  interesting to use them for the operations that are essentially
  solid fills.

* I don't fully understand why you need the tail/head_tail/head split,
  but if it is to save a branch instruction, maybe you could use the
  standard compiler trick of turning a while loop into this:

           jump test
        body:
           <body>
        test:
           <test code>
           conditional_jump body.


Thanks,
Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Jonathan Morton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 2009-10-26 at 17:14 +0100, Soeren Sandmann wrote:
> Right, instruction scheduling is the main disadvantage to using the
> assembler as opposed to intrinsics. Cortex A9 is out-of-order I
> believe, so it will have different scheduling requirements than the
> A8, which in turn likely has different scheduling requirements than
> earlier in-order CPUs.

> Though, a reasonable approach might be to
> assume that the A9 will not be very sensitive to scheduling at all
> (due to it being out-of-order), and then simply optimize for the A8.

Because A9 is OoO, it should be roughly as insensitive to scheduling as
a modern desktop processor is.  That's a huge step forward in the ARM
world, as all previous ARMs (that I know about) have been single- or
dual-issue in-order machines.

For comparison, the Pentium and Pentium-MMX were dual-issue in-order
machines, the 486 having been single-issue but superscalar (able to keep
multiple pipelines in use at once).  The P6 core introduced OoO, along
with everything else.

I can't immediately identify a mainstream PowerPC that *isn't* OoO,
except possibly the head-end core in the Cell.

Instead, the remaining v7-A CPUs without OoO are worth focusing on for
scheduling: the A8, the Snapdragon (which is very broadly similar to the
A8), and the brand-new A5 (which is single-issue).

It's not difficult to write code that works well on both A8 and
Snapdragon.  A good rule of thumb is to issue a "processing" and a
"moving" instruction together, where "moving" could be a load/store, a
register move, a vector permute, a branch or a simple-ALU instruction,
while a "processing" instruction would be any ALU op or any non-permute
vector op.  This is common to many small RISC designs, OoO or not.

The A5 is probably quite scheduling-compatible with code written for A8
or Snapdragon, and will simply run it more slowly because it can never
dual-issue.  It can apparently fold predicted branches, though.  The one
thing I would be careful of is that you can no longer do preloads for
free in the middle of a heavy processing loop - instead you have to find
an unavoidable bubble to fill to avoid slowing down the cached case.

The A9 does seem to have another trick up it's sleeve: an
auto-prefetcher.  If that's anywhere near as good as the one in the
PowerPC 970, that's going to essentially eliminate any need for manual
preloading.

> * I don't fully understand what the abits argument to the load/store
>   functions is supposed to be doing. Does it have to do with masking
>   in 0xff as appropriate? Part of this may be that I don't know what
>
>      [ &mem_operand&, :&abits& ]!
>
>   means

This is a hint to the CPU that the address will be aligned to a
guaranteed degree.  This can save a cycle in the LSU on the A8.

> * Does ARM have the equivalent of movnt? It may or may not be
>   interesting to use them for the operations that are essentially
>   solid fills.

Not that I know of.  But if you write to uncached memory, all the v7-A
cores will do write-combining quite well (unlike v6 and earlier).

If you write to cached memory, the cache controller *might* figure out
that you've filled the whole cacheline and avoid the line-fill read, but
that is likely to be more core-specific.  If you make several passes
over the same pixmap, the cached behaviour makes sense to keep rather
than forcing a bypass of it.

The difference between cached and uncached memory is rather important
actually.  You have to bias much harder towards large aligned read
instructions on uncached memory, since each one effectively counts as a
full cache miss.  With cached memory, every cache miss loads the whole
cacheline, making nearby reads available very quickly afterwards.

> * Maybe add support for skipping unnecessary destination reads? Ie.,
>   in the case of OVER, there is no reason to read the destination if
>   the mask is 0, or if the combined (SRC IN MASK) is 1.

That seems to be a valid optimisation for the rounded-rectangle case at
least, probably less so for text.  Text has more edges, so most SIMD
vectors (which are about the width of a typical glyph) would end up with
at least one translucent pixel to deal with.

It is probably possible to preprocess the source and mask vectors to
make a binary go/not decision on a per-vector basis fairly cheaply.  For
situations where memory bandwidth is the constraint, that's a good idea.
It could also allow eliminating the multiplies for that vector.

But it's also worth pointing out that sequential reads are much faster
than scattered reads or read/write/read cycles, and that branches are
themselves potentially expensive - and on Snapdragon, conditional
execution is unpredicted, making it difficult to use efficiently (unlike
conditional branches).

Except for rather large round-rectangle-type cases, a simplified inner
loop and a whole-scanline preload might be a better idea.  Of course,
real benchmarks would be needed to tell the difference reliably.

--
------
From: Jonathan Morton
      jonathan.morton@...


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 26-10-09 17:14, Soeren Sandmann wrote:

>> Technically, there should be no problem catching up with SSE2,
>> especially if instructions scheduling perfection could be skipped at
>> the first stage. Right now only the existing NEON fast path
>> functions are reimplemented plus just a few more.
>
> Right, instruction scheduling is the main disadvantage to using the
> assembler as opposed to intrinsics. Cortex A9 is out-of-order I
> believe, so it will have different scheduling requirements than the
> A8, which in turn likely has different scheduling requirements than
> earlier in-order CPUs. Though, a reasonable approach might be to
> assume that the A9 will not be very sensitive to scheduling at all
> (due to it being out-of-order), and then simply optimize for the A8.

A9 has a big downside though: the NEON block was cut-down to make place
for a full VFP block instead of the VFPlite block in the A8 cores, so
using NEON will be slower compared to A8. Hopefully the out-of-order
will make up for it, but pure NEON code will take a hit.

regards,

Koen

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Monday 26 October 2009, Soeren Sandmann wrote:

> Hi,
>
> > This branch has new ARM NEON optimizations:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update
>
> In general, I like this a lot, though I have some questions and
> comments about it. See below.
>
> > The reasons to use GNU assembler are:
> >
> > 1. Full control over registers allocation (there are not too many of
> > them, considering that up to 3 images are supported with their
> > strides, pointers, prefetch stuff). I encountered problems running
> > out of registers with inline assembly and compiling with frame
> > pointer.
> >
> > 2. This allows the use of more or less advanced macro preprocessor
> > and makes everything easier. A bit more flexible option would be to
> > use JIT code generation here (this is actually something to consider
> > later).
>
> Yes, JIT compiling is very much worth considering, and in fact, your
> general framework pretty much *is* a JIT compiler, except that it runs
> at compile time and is slightly difficult to read because it is
> written in the GNU as meta-language.
>
> > Technically, there should be no problem catching up with SSE2,
> > especially if instructions scheduling perfection could be skipped at
> > the first stage. Right now only the existing NEON fast path
> > functions are reimplemented plus just a few more.
>
> Right, instruction scheduling is the main disadvantage to using the
> assembler as opposed to intrinsics.
Well, in practice it's the other way around :) Compilers (at least gcc)
usually do a really bad job scheduling instructions, probably they are
spoiled by the dominance of out-of-order processors nowadays. They also
sometimes do a bad job at registers allocation, especially when an algorithm
uses many variables.

> Cortex A9 is out-of-order I believe, so it will have different scheduling
> requirements than the A8, which in turn likely has different scheduling
> requirements than earlier in-order CPUs. Though, a reasonable approach might
> be to assume that the A9 will not be very sensitive to scheduling at all
> (due to it being out-of-order), and then simply optimize for the A8.

More or less all the processors share similar properties. Instructions are
pipelined and have some latency for providing result, dual issue is possible
if they don't depend on each other and CPU has enough execution units
available to run them. Jonathan provided some very detailed explanation.

> > Now the thing to solve is how to handle the systems other than
> > linux. There is a potential problem with ABI compatibility - the
> > functions must be fully compatible with the calling conventions,
> > etc. For now I'm only sure that they are compatible with Linux
> > EABI. Most likely the other systems should be fine too, or will be
> > fine with a few tweaks.
>
> I think it's perfectly fine even if it is Linux specific at first;
> people interested in other operating systems can feel free to send
> patches.
I was more worried about what to do with the old NEON code. How do we even
know whether it is even used (or will be used) by anyone on any non-linux
systems?

I would probably even go as far as removing old NEON optimizations completely.
They are available in 0.16.x versions of pixman and can be taken back into
action if needed. Feedback from the users of Windows Mobile, Symbian and
maybe some other systems running on ARM would be welcome.

> Comments on the code:
>
> I have mostly looked at the general framework, and paid less attention
> to the various uses of it, except as a reference to understand how the
> framework works. And of course, since I'm unfamiliar with both GNU as
> and ARM, I may have misread things.
>
> Overall, I like this a lot. As you say, it seems to be general enough
> to support pretty much all the pixman operations, as long as there are
> no transformations involved or 24 bpp formats (which are always a
> pain).
Yes, and even 24bpp can be supported, though it will add a lot more
of conditionally compiled parts and clutter the code a bit.

24bpp may be quite useful for accelerating some of the GDK stuff. I actually
think about the idea of also reusing NEON graphics optimizations in different
libraries like GDK, SDL and maybe something else in order to improve
overall performance of the software in linux in general.

> It would be interesting to do something similar for the SSE2 backend.

I actually thought that VMX variant may be the easiest and straightforward
conversion.

> General
>
> If possible, I think it would be useful to break down the
> composite_composite_function macro into smaller bits, that mirror the
> structure of the generated code, and then add a comment to the top of
> each sub-routine. For example, a sub-macro for the initialization of
> the registers plus setup, one for the left unligned part of the
> scanline, one for the middle part, one for the right unaligned part,
> one for moving to the next line, and one for doing the small rectangle
> part.
>
> I think this would make it easier to grasp what is going on.
I was considering to add more comments there, but was not sure whether it
makes much sense before all the features are implemented (like 24bpp support).

> Also, in various places, more specific comments would be useful, in
> addition to the (generally good) highlevel comments that are already
> there.
>
> For example,
>
> * I don't fully understand what the abits argument to the load/store
>   functions is supposed to be doing. Does it have to do with masking
>   in 0xff as appropriate? Part of this may be that I don't know what
>
>      [ &mem_operand&, :&abits& ]!
That's an alignment specifier. Load/store instructions with strictly
specified alignment (128-bit or more) are a bit faster. And surely,
memory address needs to be properly aligned, otherwise we get an
exception.

>   means
>
> * A comment that .regs_shortage really means that H and W are spilled
>   to the stack.

Yes, for the most complex cases like having source, mask and destination,
there are not enough spare registers for all the local variables, so some
of the data has to go to stack.

> and similar.
>
> Though in general, the code is well commented.
>
> Denterleaving:
>
> What is the benefit of deinterleaving?

We just can load the following data
A1R1G1B1 A2R2G2B2 A3R3G3B3 A4R4G4B4 A5R5G5B5 A6R6G6B6 A7R7G7B7 A8R8G8B8

... into four 64-bit NEON registers like this:
A1A2A3A4A5A6A7A8 R1R2R3R4R5R6R7R8 G1G2G3G4G5G6G7G8 B1B2B3B4B5B6B7B8

It can be done either by a dedicated VLD4 instruction (takes 4 cycles) or
by VLD1 instruction followed by 4 VUZP instructions (takes 3 + 4 cycles).

And then it is easy to do bulk SIMD multiplication of A1A2A3A4A5A6A7A8 by
R1R2R3R4R5R6R7R8 or similar operations. Doing eight 8-bit multiplications
per cycle makes alpha blending really fast.

> Why is there is no deinterleaving for the inner loop, and only for the
> head and tail? Are the callers supposed to do this themselves if they
> want it?

Deinterleaving is optional, and the developer who is implementing a new fast
path function may decide not to use it. It just does not make sense for
SRC or ADD operations, but helps OVER a lot.

Inner loop has to do some memory operations, so it has to use the right
instruction (either VLD1 or VLD4) depending on the preferred data
representation.

> Cache preload:
>
> * As far as I can tell, this macro is preloading with PF_X relative to
>   PF_SRC, but PF_X is always an offset into a scanline because you
>   subtract ORIG_W whenever it exceed, and PF_SRC is set to SRC, but
>   never updated anywhere that I can see. So it preloads different
>   parts of the first line over and over?

PF_SRC is updated by LDRGEB instruction (it is pre-increment variant of
addressing), so it gets changed to the next scanline there.

> * It seems like you could save a bunch of registers by simply always
>   prefetching some fixed number of pixels ahead of where you are going
>   to read. Or alternatively, just dump PF_X and keep the number of
>   pixels to prefetch ahead in that register. But presumably there is
>   some reason not to do this.

Yes, it is just generally faster, I posted some benchmarks earlier:
http://lists.cairographics.org/archives/cairo/2009-July/017813.html

A longer explanation is the following. Because at least OMAP3 has
a rather slow memory, prefetch distance needs to be quite large.
Depending on the performance of data processing part, prefetch may need
to be up to ~300 bytes ahead to work efficiently. 300 bytes ahead is
something like 150 pixels for 16bpp and it is rather a lot.

When using some fixed prefetch distance, handling of relatively small
images becomes inefficient for the cases when stride is much larger than
width. So 300 bytes ahead prefetch would be totally useless and even
harmful for processing images having width smaller than 150 pixels. Though
admittedly, most images have stride more or less equal to width (unless
somebody makes a heavy use of "subimages"). On the other hand, when
prefetching is needed for the destination buffer (for OVER operation),
fine-grained prefetch helps a lot because stride and width are quite
common to be different for the destination buffer.

> * If prefetch_distance is 0, shouldn't this macro not generate
>   anything at all?

Well, this behavior is undefined at the moment :)

Having an option to either disable prefetch completely or use a simple "fixed
distance ahead" prefetcher may be a useful addition.

Simple prefetcher may be probably useful for A5 cores, based on the
information from Jonathan.

> * I see that you have (H - 1) stored in the upper 28 bits of PF_CTL,
>   but are those bits actually being used for anything other than
>   preventing the ldrs? Ie., it will still attempt to use the plds
>   below the image, right?

It is used as the counter which limits the number of jumps to the next
scanline. Once the counter gets negative, the scanline gets latched
and prefetch may run over the last scanline multiple times but never
goes below it.

> If I'm misreading this completely, it would be good to have some more
> detailed comments here.
>
> A couple of things that it may or may not be worth thinking about:
>
> * Maybe add support for skipping unnecessary destination reads? Ie.,
>   in the case of OVER, there is no reason to read the destination if
>   the mask is 0, or if the combined (SRC IN MASK) is 1.
>
>   In my experience, this is a win for many images that
>   happen in practice. Consider these common cases:
>
>          - Text: most pixels are either fully opaque or fully
>            transparent, and in either case, no destination read is
>            necessary.
>
>          - Rounded antialiased rectangle. The corners are transparent
>            and the body is opaque or transparent. Essentially none of
>            the pixels actually need the destination.
NEON code processes 8 pixels at once. Checking individual pixels is not going
to work well. Checking any NEON computed result (like "SRC IN MASK") and
branching based on it is also not going to work well just because transferring
data from NEON to ARM is very slow (~20 cycles).

> * Does ARM have the equivalent of movnt? It may or may not be
>   interesting to use them for the operations that are essentially
>   solid fills.

There is no such instruction as far as I know.

> * I don't fully understand why you need the tail/head_tail/head split,
>   but if it is to save a branch instruction, maybe you could use the
>   standard compiler trick of turning a while loop into this:
>
>            jump test
>         body:
>            <body>
>         test:
>            <test code>
>            conditional_jump body.
No, it's just a trick to improve instructions scheduling. Let's suppose
that we have to use 4 instructions per pixel (L - load from the source buffer,
A and B - some arithmetic, S - store to the destination buffer). In this
case, a naive loop for processing 4 pixels would look like this:

(L1 A1 B1 S1) (L2 A2 B2 S2) (L3 A3 B3 S3) (L4 A4 B4 S4)

Let's also suppose that each four instructions for processing each pixel
make up a dependency chain which prevents dual issue.

Now let's split pixels processing code into "head" (L A) and "tail" (B S)
parts. The pipelined loop processing this data would look like:

head1 (tail1 head2) (tail2 head3) (tail3 head4) tail4

or if we expand code to instructions and reorder them a bit:

L1 A1 (L2 B1 A2 S1) (L3 B2 A3 S2) (L4 B3 A4 S3) B4 S4

This is much better for performance because now we have pairs of instructions
which have some very nice properties:
1. they are now fully independent from each other
2. they use different execution units (load-store and arithmetic), which is
critical for NEON as it is the only dual issue possibility

So we get some small setup overhead, but the main loop can run up to twice
faster because of dual issue possibilities.

But even without considering dual issue, the instructions may have (and do
have) some latencies. With the pipelined variant of code, now the distance
between L2 and S2 for example is much larger and spans over two loop
iterations. If we suppose that each of the instruction had latency 2 and we
had a single cycle stall after each instruction in the original code, then
the pipelined code can execute without any stalls and the performance is
doubled again.

In practice, with real NEON fast path functions on Cortex-A8, pipelining
provides up to 30-40% speedup for working with the data fully cached in L1
cache.

--
Best regards,
Siarhei Siamashka


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

signature.asc (196 bytes) Download Attachment

Re: pixman: New ARM NEON optimizations

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 27-10-09 03:18, Siarhei Siamashka wrote:
> On Monday 26 October 2009, Soeren Sandmann wrote:

>>> Now the thing to solve is how to handle the systems other than
>>> linux. There is a potential problem with ABI compatibility - the
>>> functions must be fully compatible with the calling conventions,
>>> etc. For now I'm only sure that they are compatible with Linux
>>> EABI. Most likely the other systems should be fine too, or will be
>>> fine with a few tweaks.
>>
>> I think it's perfectly fine even if it is Linux specific at first;
>> people interested in other operating systems can feel free to send
>> patches.
>
> I was more worried about what to do with the old NEON code. How do we even
> know whether it is even used (or will be used) by anyone on any non-linux
> systems?
>
> I would probably even go as far as removing old NEON optimizations completely.
> They are available in 0.16.x versions of pixman and can be taken back into
> action if needed. Feedback from the users of Windows Mobile, Symbian and
> maybe some other systems running on ARM would be welcome.

WinMo and WinCE don't support NEON currently, you can hack it in, but
then you loose VFP support. Supposedly fixed in the next WinMo release
(R4?).

regards,

Koen

_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

> > I think it's perfectly fine even if it is Linux specific at first;
> > people interested in other operating systems can feel free to send
> > patches.
>
> I was more worried about what to do with the old NEON code. How do we even
> know whether it is even used (or will be used) by anyone on any non-linux
> systems?
>
> I would probably even go as far as removing old NEON optimizations completely.
> They are available in 0.16.x versions of pixman and can be taken back into
> action if needed. Feedback from the users of Windows Mobile, Symbian and
> maybe some other systems running on ARM would be welcome.

I agree. It makes sense to remove them at least in the 0.17 series,
and if people complain, we can pull them back in.

> Yes, and even 24bpp can be supported, though it will add a lot more
> of conditionally compiled parts and clutter the code a bit.
>
> 24bpp may be quite useful for accelerating some of the GDK stuff. I actually
> think about the idea of also reusing NEON graphics optimizations in different
> libraries like GDK, SDL and maybe something else in order to improve
> overall performance of the software in linux in general.

At some point it makes sense to make the pixman API available for
general consumption so that people can use it in cases like
those. This may require an API break first though, since parts of the
API are currently somewhat embarrassing (pixman_set_static_pointers(),
and pixman_blt()/fill() come to mind, but the 16 bit regions should
also eventually go away).

Some of those cases should probably just use cairo.

> > If possible, I think it would be useful to break down the
> > composite_composite_function macro into smaller bits, that mirror the
> > structure of the generated code, and then add a comment to the top of
> > each sub-routine. For example, a sub-macro for the initialization of
> > the registers plus setup, one for the left unligned part of the
> > scanline, one for the middle part, one for the right unaligned part,
> > one for moving to the next line, and one for doing the small rectangle
> > part.
> >
> > I think this would make it easier to grasp what is going on.
>
> I was considering to add more comments there, but was not sure whether it
> makes much sense before all the features are implemented (like 24bpp
> support).

Breaking it down into submacros that mirror the structure of the
generated code would be useful before pushing this, I think.

> > Also, in various places, more specific comments would be useful, in
> > addition to the (generally good) highlevel comments that are already
> > there.
> >
> > For example,
> >
> > * I don't fully understand what the abits argument to the load/store
> >   functions is supposed to be doing. Does it have to do with masking
> >   in 0xff as appropriate? Part of this may be that I don't know what
> >
> >      [ &mem_operand&, :&abits& ]!
>
> That's an alignment specifier. Load/store instructions with strictly
> specified alignment (128-bit or more) are a bit faster. And surely,
> memory address needs to be properly aligned, otherwise we get an
> exception.

Ok, that sense. I had gotten it into my head that "a" stood for alpha,
so I couldn't make heads or tail of it.

> > * If prefetch_distance is 0, shouldn't this macro not generate
> >   anything at all?
>
> Well, this behavior is undefined at the moment :)

But in pixman_composite_src_n_8_asm_neon(), you do set it to 0?



Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Jacob Bramley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > Right, instruction scheduling is the main disadvantage to using the
> > assembler as opposed to intrinsics. Cortex A9 is out-of-order I
> > believe, so it will have different scheduling requirements than the
> > A8, which in turn likely has different scheduling requirements than
> > earlier in-order CPUs. Though, a reasonable approach might be to
> > assume that the A9 will not be very sensitive to scheduling at all
> > (due to it being out-of-order), and then simply optimize for the A8.
>
> A9 has a big downside though: the NEON block was cut-down to make place
> for a full VFP block instead of the VFPlite block in the A8 cores, so
> using NEON will be slower compared to A8. Hopefully the out-of-order
> will make up for it, but pure NEON code will take a hit.

Where did you hear this? A8 does indeed have VFPlite (as opposed to A9's full VFP
implementation), but A9 certainly doesn't have a cut-down NEON.

Something to be aware of, however, is that A9 can't issue a VFP instruction until the previous
NEON instruction has cleared out of the pipeline (and vice-versa), so interleaving NEON and VFP
really hurts on A9 but didn't really make a huge difference on A8.




_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Parent Message unknown Re: pixman: New ARM NEON optimizations

by Koen Kooi-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 27-10-09 15:13, Jacob Bramley wrote:

>>> Right, instruction scheduling is the main disadvantage to using the
>>> assembler as opposed to intrinsics. Cortex A9 is out-of-order I
>>> believe, so it will have different scheduling requirements than the
>>> A8, which in turn likely has different scheduling requirements than
>>> earlier in-order CPUs. Though, a reasonable approach might be to
>>> assume that the A9 will not be very sensitive to scheduling at all
>>> (due to it being out-of-order), and then simply optimize for the A8.
>>
>> A9 has a big downside though: the NEON block was cut-down to make place
>> for a full VFP block instead of the VFPlite block in the A8 cores, so
>> using NEON will be slower compared to A8. Hopefully the out-of-order
>> will make up for it, but pure NEON code will take a hit.
>
> Where did you hear this? A8 does indeed have VFPlite (as opposed to A9's full VFP
> implementation), but A9 certainly doesn't have a cut-down NEON.

I keep hearing it on various places in the internet, but apparently I'm
wrong. Laurent went on record and said "the A9 NEON unit is the exact
same as the A8 one"

> Something to be aware of, however, is that A9 can't issue a VFP instruction until the previous
> NEON instruction has cleared out of the pipeline (and vice-versa), so interleaving NEON and VFP
> really hurts on A9 but didn't really make a huge difference on A8.

I guess that's where the confusing stems from. Let's hope google indexes
this thread so the truth will surface :)

regards,

Koen


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Jonathan Morton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 2009-10-27 at 15:23 +0100, Koen Kooi wrote:
> > Something to be aware of, however, is that A9 can't issue a VFP instruction until the previous
> > NEON instruction has cleared out of the pipeline (and vice-versa), so interleaving NEON and VFP
> > really hurts on A9 but didn't really make a huge difference on A8.
>
> I guess that's where the confusing stems from. Let's hope google indexes
> this thread so the truth will surface :)

It's also a potentially significant difference from PowerPC's Altivec
(or VMX).  The PPC G4 and G5 have scalar FPUs that are totally separate
from the vector unit, allowing potentially 5 or 6 single-precision FLOPS
per cycle if you really push hard.  This is helped by the triple and
quadruple issue front-ends, which allow some memory and control traffic
in addition to full FP throughput.

On ARMv7-A, the "scalar" FPU is aliased into the NEON registers (and
possibly other hardware), which is probably why the A9 requires the
pipeline flush when switching between the two.  It's still worthwhile to
run integer control logic "behind" the vector computation though, and
double-issue is good enough for that.

Of course, we're not doing heavy FP in Pixman...

Another trick about the A9 I found by scraping the public docs: the A9
has a wide and flexible dispatch logic (each execution unit can pull an
instruction from the queue), but can only decode 2 instructions per
cycle (which then join the dispatch queue).

So for most purposes you should treat it as being dual-issue like an A8,
but with far fewer rules about which instructions can go together, and
less worry about precise latencies - obviously you can still be held up
by saturating an execution unit.  I can't tell whether it is able to
fold a branch in addition to the two full decodes, but if you optimise
for A8 that doesn't matter.

--
------
From: Jonathan Morton
      jonathan.morton@...


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tuesday 27 October 2009, Soeren Sandmann wrote:
> > I would probably even go as far as removing old NEON optimizations
> > completely. They are available in 0.16.x versions of pixman and can be
> > taken back into action if needed. Feedback from the users of Windows
> > Mobile, Symbian and maybe some other systems running on ARM would be
> > welcome.
>
> I agree. It makes sense to remove them at least in the 0.17 series,
> and if people complain, we can pull them back in.

OK

> > Yes, and even 24bpp can be supported, though it will add a lot more
> > of conditionally compiled parts and clutter the code a bit.
> >
> > 24bpp may be quite useful for accelerating some of the GDK stuff. I
> > actually think about the idea of also reusing NEON graphics optimizations
> > in different libraries like GDK, SDL and maybe something else in order to
> > improve overall performance of the software in linux in general.
>
> At some point it makes sense to make the pixman API available for
> general consumption so that people can use it in cases like
> those. This may require an API break first though, since parts of the
> API are currently somewhat embarrassing (pixman_set_static_pointers(),
> and pixman_blt()/fill() come to mind, but the 16 bit regions should
> also eventually go away).
It might be not API break but extension. A function which would get
information about source, mask and destination image formats and some
other hints (like solid mask or source) and just return a pointer to
a simple directly callable function would do this job. It can return
NULL if and optimized implementation is not available, or even generate
function code using JIT if it is able to do this.

In any case, simple functions which use only basic data types as arguments
(nothing like pixman_image_t) and has no checks for clipping or whatever
else can be used from the other libraries efficiently.

> Some of those cases should probably just use cairo.

Yes, but not in all cases. If some library does not support image clipping and
expects the client application to provide valid data, doing this validation
just wastes CPU cycles.

> > > If possible, I think it would be useful to break down the
> > > composite_composite_function macro into smaller bits, that mirror the
> > > structure of the generated code, and then add a comment to the top of
> > > each sub-routine. For example, a sub-macro for the initialization of
> > > the registers plus setup, one for the left unligned part of the
> > > scanline, one for the middle part, one for the right unaligned part,
> > > one for moving to the next line, and one for doing the small rectangle
> > > part.
> > >
> > > I think this would make it easier to grasp what is going on.
> >
> > I was considering to add more comments there, but was not sure whether it
> > makes much sense before all the features are implemented (like 24bpp
> > support).
>
> Breaking it down into submacros that mirror the structure of the
> generated code would be useful before pushing this, I think.
Did it. At least splitted some parts of code which are more or less
logically independent.

> > > * If prefetch_distance is 0, shouldn't this macro not generate
> > >   anything at all?
> >
> > Well, this behavior is undefined at the moment :)
>
> But in pixman_composite_src_n_8_asm_neon(), you do set it to 0?

This is also addressed now.


Here is an update for pixman NEON optimizations based on the comments:
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=arm-neon-update-v2

Changes also include:
* support for 24bpp pixel formats
* choice between 'none', 'simple' and 'advanced' prefetch methods
* old NEON code dropped, also it eliminates the problem with float ABI
selection controversy
* more comments

Also automated tests pass and updated pixman works fine on real device without
visible problems.

Things to do next:
* finetune performance for some of the functions
* add GDK pixbuf related fast path functions
* try to reduce dispatch overhead in pixman-arm-neon.c (maybe adding some
probably autogenerated code with if/else/switch/case mess in
arm_neon_composite to decide which NEON fast path function can be used instead
of calling _pixman_run_fast_path can reduce a bit of overhead).

--
Best regards,
Siarhei Siamashka


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

signature.asc (196 bytes) Download Attachment

Re: pixman: New ARM NEON optimizations

by Chris Wilson-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Excerpts from Siarhei Siamashka's message of Wed Nov 04 16:36:21 +0000 2009:
> In any case, simple functions which use only basic data types as arguments
> (nothing like pixman_image_t) and has no checks for clipping or whatever
> else can be used from the other libraries efficiently.

+1. ;-)

I could also use a pixman-core, though the overhead after applying
Søren's flags branch is not large enough to justify the effort, yet.

[snip]
> Things to do next:
> * finetune performance for some of the functions
> * add GDK pixbuf related fast path functions
> * try to reduce dispatch overhead in pixman-arm-neon.c (maybe adding some
> probably autogenerated code with if/else/switch/case mess in
> arm_neon_composite to decide which NEON fast path function can be used instead
> of calling _pixman_run_fast_path can reduce a bit of overhead).

Before going too far along the performance, make sure you are also
checking with Søren's work in reducing overheads.

Also I'm very interested in seeing what the profiles look like for
pixman and cairo on ARM. Just knowledge of the behaviour of your target
applications would be useful when thinking about how to tune cairo.
-ickle
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Soeren Sandmann-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Chris Wilson <chris@...> writes:

> Excerpts from Siarhei Siamashka's message of Wed Nov 04 16:36:21 +0000 2009:
> > In any case, simple functions which use only basic data types as arguments
> > (nothing like pixman_image_t) and has no checks for clipping or whatever
> > else can be used from the other libraries efficiently.
>
> +1. ;-)
>
> I could also use a pixman-core, though the overhead after applying
> Søren's flags branch is not large enough to justify the effort, yet.

The case where pixman has a lot of overhead is glyphs, which is why
pixman needs explicit support for that.

If you have other use cases than glyphs, I'd be interested in hearing
about them.



Soren
_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

Re: pixman: New ARM NEON optimizations

by Chris Wilson-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Excerpts from Soeren Sandmann's message of Thu Nov 05 11:04:30 +0000 2009:
> Chris Wilson <chris@...> writes:
> > I could also use a pixman-core, though the overhead after applying
> > Søren's flags branch is not large enough to justify the effort, yet.
>
> The case where pixman has a lot of overhead is glyphs, which is why
> pixman needs explicit support for that.

I'd argue the opposite. pixman does not need explicit support for
glyphs, just support for passing a RECTLIST to pixman_image_composite().
This then covers spans as well but will still likely incur overhead from
per RECT region computation. A polygon image can be added as a later
refinement. The only question then is who has the most information to
decide what the best image format to use for caches and other
intermediates. It's a mix of quality (application) and
hardware/behaviour characteristics.

But at the moment, about the only thing I truly want to add to the API
is a LERP operator.
-ickle
--
Chris Wilson, Intel Open Source Technology Centre


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

0001-Add-LERP-operator.patch (12K) Download Attachment

Re: pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thursday 05 November 2009, Chris Wilson wrote:
> Excerpts from Siarhei Siamashka's message of Wed Nov 04 16:36:21 +0000 2009:
[...]

> > * try to reduce dispatch overhead in pixman-arm-neon.c (maybe adding some
> > probably autogenerated code with if/else/switch/case mess in
> > arm_neon_composite to decide which NEON fast path function can be used
> > instead of calling _pixman_run_fast_path can reduce a bit of overhead).
>
> Before going too far along the performance, make sure you are also
> checking with Søren's work in reducing overheads.
>
> Also I'm very interested in seeing what the profiles look like for
> pixman and cairo on ARM. Just knowledge of the behaviour of your target
> applications would be useful when thinking about how to tune cairo.
ARM seems to be faster at processing pixels (per MHz) than x86, probably
because its SIMD is more suited for this task (it supports long 8-bit
multiplications, shifts with rounding, shifts with insert). But ARM is slower
at running the rest of code. That's why the time spent in the intermediate
layers shows up higher on ARM and looks like a problem. Anything that can
improve the situation is very much welcome.

I can post some profiles, the problem is which use case to select :)
Do you have any preferences in particular?

--
Best regards,
Siarhei Siamashka


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

signature.asc (196 bytes) Download Attachment

Re: pixman: New ARM NEON optimizations

by Siarhei Siamashka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thursday 05 November 2009, Soeren Sandmann wrote:
> Chris Wilson <chris@...> writes:
> > Excerpts from Siarhei Siamashka's message of Wed Nov 04 16:36:21 +0000
2009:

> > > In any case, simple functions which use only basic data types as
> > > arguments (nothing like pixman_image_t) and has no checks for clipping
> > > or whatever else can be used from the other libraries efficiently.
> >
> > +1. ;-)
> >
> > I could also use a pixman-core, though the overhead after applying
> > Søren's flags branch is not large enough to justify the effort, yet.
>
> The case where pixman has a lot of overhead is glyphs, which is why
> pixman needs explicit support for that.
>
> If you have other use cases than glyphs, I'd be interested in hearing
> about them.
It is quite easy to construct such a case. For example a simple demo program
which renders really lots of small particles (to get some nice visual effect)
using images with alpha transparency.

It's kind of chicken/egg problem. Slow operations do not get proper
optimizations because they are not used in real applications. And application
developers probably avoid these slow operations just because ... they are
slow :)

As you noticed earlier, software RENDER extension implementation in xserver
suffers from creating and destroying temporary pixman_image_t structures for
each operation in fbComposite function (PicturePtr and pixman_image_t are
practically duplicates of each other). But this is not a good excuse to be
wasteful regarding CPU cycles in pixman too. If anything can be simplified and
optimized even a bit with relatively little efforts, probably this should be
done. Or is it better to fix xserver first and then look at pixman performance
again?

--
Best regards,
Siarhei Siamashka


_______________________________________________
cairo mailing list
cairo@...
http://lists.cairographics.org/mailman/listinfo/cairo

signature.asc (196 bytes) Download Attachment