|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
optimizing query with multiple aggregatesI have a reporting query that is taking nearly all of it's time in aggregate functions and I'm trying to figure out how to optimize it. The query takes approximately 170ms when run with "select *", but when run with all the aggregate functions the query takes 18 seconds. The slowness comes from our attempt to find distribution data using selects of the form:
SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) repeated across many different x,y values and fields to build out several histograms of the data. The main culprit appears to be the CASE statement, but I'm not sure what to use instead. I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this? Thanks for your help, Doug |
|
|
Re: optimizing query with multiple aggregatesOn Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@...> wrote:
> I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had > similar queries and I was wondering what methods they used to build out data > like this? have you tried: count(*) where field >= x AND field < y; ?? merlin -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
|
|
Re: optimizing query with multiple aggregatesOn Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@...> wrote:
> > On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@...> wrote: > > I have a reporting query that is taking nearly all of it's time in aggregate > > functions and I'm trying to figure out how to optimize it. The query takes > > approximately 170ms when run with "select *", but when run with all the > > aggregate functions the query takes 18 seconds. The slowness comes from our > > attempt to find distribution data using selects of the form: > > > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > > > repeated across many different x,y values and fields to build out several > > histograms of the data. The main culprit appears to be the CASE statement, > > but I'm not sure what to use instead. I'm sure other people have had > > similar queries and I was wondering what methods they used to build out data > > like this? > > have you tried: > > count(*) where field >= x AND field < y; > > ?? > > merlin Unless I'm misunderstanding you, that would require breaking each bin into a separate sql statement and since I'm trying to calculate more than 100 bins between the different fields any improvement in the aggregate functions would be overwhelmed by the cost of the actual query, which is about 170ms. Thanks, Doug -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
|
|
Re: optimizing query with multiple aggregatesOn Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@...> wrote:
Use group by with an appropriate division/rounding to create the appropriate buckets, if they're all the same size. select round(field/100) as bucket, count(*) as cnt from foo group by round(field/100); -- - David T. Wilson david.t.wilson@... |
|
|
Re: optimizing query with multiple aggregatesSo you've got a query like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty,
... FROM bigtable
My guess is this forcing a whole bunch of if checks and your getting cpu bound. Could you try something like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen,
SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty,
... FROM (SELECT field, count(*) FROM bigtable GROUP BY field)
which will allow a hash aggregate? You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins.
This all supposes that you don't want to just query postgres's column statistics.
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@...> wrote:
|
|
|
Re: optimizing query with multiple aggregatesOn Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
> I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had > similar queries and I was wondering what methods they used to build out data > like this? > Thanks for your help, > Doug Hi Doug, Have you tried using the width_bucket() function? Here is a nice article describing its use for making histograms: http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
|
|
Re: optimizing query with multiple aggregatesOn Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@...> wrote:
> On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@...> wrote: >> >> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@...> wrote: >> > I have a reporting query that is taking nearly all of it's time in aggregate >> > functions and I'm trying to figure out how to optimize it. The query takes >> > approximately 170ms when run with "select *", but when run with all the >> > aggregate functions the query takes 18 seconds. The slowness comes from our >> > attempt to find distribution data using selects of the form: >> > >> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) >> > >> > repeated across many different x,y values and fields to build out several >> > histograms of the data. The main culprit appears to be the CASE statement, >> > but I'm not sure what to use instead. I'm sure other people have had >> > similar queries and I was wondering what methods they used to build out data >> > like this? >> >> have you tried: >> >> count(*) where field >= x AND field < y; >> >> ?? >> >> merlin > > Unless I'm misunderstanding you, that would require breaking each bin > into a separate sql statement and since I'm trying to calculate more > than 100 bins between the different fields any improvement in the > aggregate functions would be overwhelmed by the cost of the actual > query, which is about 170ms. Well, you might be able to use subselects to fetch all the results in a single query, but it might still be slow. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
|
|
Re: optimizing query with multiple aggregatesOn 10/21/09 3:51 PM, "Doug Cole" <dougcole@...> wrote: > I have a reporting query that is taking nearly all of it's time in aggregate > functions and I'm trying to figure out how to optimize it. The query takes > approximately 170ms when run with "select *", but when run with all the > aggregate functions the query takes 18 seconds. The slowness comes from our > attempt to find distribution data using selects of the form: > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) > > repeated across many different x,y values and fields to build out several > histograms of the data. The main culprit appears to be the CASE statement, > but I'm not sure what to use instead. I'm sure other people have had similar > queries and I was wondering what methods they used to build out data like > this? You might be able to do this with plain aggregates. Define a function that generates your partitions that you can group by, then aggregate functions for the outputs In either case, rather than each result being a column in one result row, each result will be its own row. Each row would have a column that defines the type of the result (that you grouped on), and one with the result value. If each is just a sum, its easy. If there are lots of different calculation types, it would be harder. Potentially, you could wrap that in a subselect to pull out each into its own column but that is a bit messy. Also, in 8.4 window functions could be helpful. PARTITION BY something that represents your buckets perhaps? http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html This will generally force a sort, but shouldn't be that bad. The function used for the group by or partition by could just be a big case statement to generate a unique int per bucket, or a truncate/rounding function. It just needs to spit out a unique result for each bucket for the group or partition. > Thanks for your help, > Doug > -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
|
|
Re: optimizing query with multiple aggregatesHello,
I didn't try it, but following should be slightly
faster:
COUNT( CASE WHEN field >= x AND field < y THEN true
END)
intead of SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0
END)
HTH,
Marc
Mamin
From: pgsql-performance-owner@... [mailto:pgsql-performance-owner@...] On Behalf Of Nikolas Everett Sent: Thursday, October 22, 2009 4:48 AM To: Doug Cole Cc: pgsql-performance Subject: Re: [PERFORM] optimizing query with multiple aggregates SELECT SUM(CASE
WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as
zeroToTen,
SUM(CASE WHEN
field >= 10 AND field < 20 THEN 1 ELSE 0 END) as
tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0
END) as tenToTwenty,
...
FROM
bigtable
My guess is this
forcing a whole bunch of if checks and your getting cpu bound. Could you
try something like:
SELECT SUM(CASE
WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as
zeroToTen,
SUM(CASE WHEN
field >= 10 AND field < 20 THEN count ELSE 0 END) as
tenToTwenty,
SUM(CASE WHEN field >= 20 AND field < 30 THEN count
ELSE 0 END) as tenToTwenty,
...
FROM (SELECT
field, count(*) FROM bigtable GROUP BY field)
which will allow a hash
aggregate? You'd do a hash aggregate on the whole table which should be
quick and then you'd summarize your bins.
This all supposes that
you don't want to just query postgres's column statistics.
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@...> wrote:
|
|
|
Re: optimizing query with multiple aggregatesOn Thu, Oct 22, 2009 at 6:22 AM, Kenneth Marshall <ktm@...> wrote:
> On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote: >> I have a reporting query that is taking nearly all of it's time in aggregate >> functions and I'm trying to figure out how to optimize it. The query takes >> approximately 170ms when run with "select *", but when run with all the >> aggregate functions the query takes 18 seconds. The slowness comes from our >> attempt to find distribution data using selects of the form: >> >> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END) >> >> repeated across many different x,y values and fields to build out several >> histograms of the data. The main culprit appears to be the CASE statement, >> but I'm not sure what to use instead. I'm sure other people have had >> similar queries and I was wondering what methods they used to build out data >> like this? >> Thanks for your help, >> Doug > > Hi Doug, > > Have you tried using the width_bucket() function? Here is a nice > article describing its use for making histograms: > > http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html > > Regards, > Ken > Thanks Ken, I ended up going with this approach - it meant I had to break it into a lot more queries, one for each histogram, but even with that added overhead I cut the time down from 18 seconds to right around 1 second. Doug -- Sent via pgsql-performance mailing list (pgsql-performance@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| Free embeddable forum powered by Nabble | Forum Help |