Speeding up queries with simplified geometries.

View: New views
3 Messages — Rating Filter:   Alert me  

Speeding up queries with simplified geometries.

by JDS-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've downloaded the Berkley administrative boundaries shape files and now I'm trying to formulate queries that will give me the administrative regions that contain any given lat/long combination. Obviously it's blindingly slow if I just search through the geometries using ST_Within, even though I've got it indexed. The problem is related to the number of points in the boundaries. So for example Canada, which has over a million points, is slow to test against. So I've created another indexed column which holds the bounding boxes for every geometry and doing ST_Within using that geometry is very fast, though it does return strange results, for example the lat/lng of Toronto airport is inside the bounding box for Alaska! But that's OK because at least I've cut the number of geometries to a handful. Then I need to search through that handful testing against the full geometries to eliminate the false positives.

I was expecting to be able to do something like;-

ST_Within(my_test_geom,table.bounding_box)
AND
ST_Within(my_test_geom,table.the_full_geom)

or

my_test_geom && table.bounding_box AND  ST_Within(my_test_geom,table.the_full_geom)

But it still does a full scan on table.the_full_geom, which takes ages at sets the CPU to 100% while it's doing it.

I've tried to do sub-selects in the hope that by sub-selecting the entries where the bounding box contains the point it would then  compare the full geometries for only those records, but no luck.

If it comes to it I'm going to have to write a stored procedure that gets the handfull of records which pass the test using the bounding box and then test each of them individually against the full geom to get just the ones that pass.

Is there a better way to do this?

Ta

John Small


_______________________________________________
postgis-users mailing list
postgis-users@...
http://postgis.refractions.net/mailman/listinfo/postgis-users

Re: Speeding up queries with simplified geometries.

by Paul Ramsey-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Something is not right here. If you look into the definition of
ST_Within, you'll see this:

SELECT $1 && $2 AND _ST_Within($1,$2)

So using ST_Within includes an index clause. You say you've indexed
your table, what command did you use to do that? Did you use a GiST
index on the geometry column? You shouldn't need to break geometries
out into boxes, since that's what the index is already doing. (Using
smaller delegates does get your around tuple-toasting overhead,
though, which for very large objects can be considerable.)

How big are your geometries?
select npoints(the_geom) from thetable order by npoints limit 25;
What is your PostGIS version?
select postgis_full_version();

P.

(That said, for really really large geometries, you'll find that
breaking them up does make things faster, because the indexing code
memcpy's them out as part of the consistency check,

On Mon, Jul 6, 2009 at 10:06 AM, jds<jds340@...> wrote:

> I've downloaded the Berkley administrative boundaries shape files and now
> I'm trying to formulate queries that will give me the administrative regions
> that contain any given lat/long combination. Obviously it's blindingly slow
> if I just search through the geometries using ST_Within, even though I've
> got it indexed. The problem is related to the number of points in the
> boundaries. So for example Canada, which has over a million points, is slow
> to test against. So I've created another indexed column which holds the
> bounding boxes for every geometry and doing ST_Within using that geometry is
> very fast, though it does return strange results, for example the lat/lng of
> Toronto airport is inside the bounding box for Alaska! But that's OK because
> at least I've cut the number of geometries to a handful. Then I need to
> search through that handful testing against the full geometries to eliminate
> the false positives.
>
> I was expecting to be able to do something like;-
>
> ST_Within(my_test_geom,table.bounding_box)
> AND
> ST_Within(my_test_geom,table.the_full_geom)
>
> or
>
> my_test_geom && table.bounding_box AND
> ST_Within(my_test_geom,table.the_full_geom)
>
> But it still does a full scan on table.the_full_geom, which takes ages at
> sets the CPU to 100% while it's doing it.
>
> I've tried to do sub-selects in the hope that by sub-selecting the entries
> where the bounding box contains the point it would then  compare the full
> geometries for only those records, but no luck.
>
> If it comes to it I'm going to have to write a stored procedure that gets
> the handfull of records which pass the test using the bounding box and then
> test each of them individually against the full geom to get just the ones
> that pass.
>
> Is there a better way to do this?
>
> Ta
>
> John Small
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users@...
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
_______________________________________________
postgis-users mailing list
postgis-users@...
http://postgis.refractions.net/mailman/listinfo/postgis-users

Parent Message unknown Re: Speeding up queries with simplified geometries.

by JDS-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

How big are your geometries?

I get ;-
 select ST_npoints(the_geom) from territory_geometries order by ST_npoints desc limit 25;
 st_npoints
------------
    2065217
    1968994
    1859352
    1625664
    1312561
    1199055
     743194
     619309
     613689
     592206
     567210
     566678
     562565
     562527
     540225
     522073
     510018
     457435
     450007
     420056
     412069
     412069
     394161
     380606
     374984
(25 rows)

As you can see some of the geometries are pretty big.

Version?
 postgis_full_version                              
----------------------------------------------------------------------------------
 POSTGIS="1.3.3" GEOS="3.0.0-CAPI-1.4.1" PROJ="Rel. 4.6.0, 21 Dec 2007" USE_STATS

Index?

CREATE INDEX terr_geom_gist  ON territory_geometries  USING gist  (the_geom);

and the same for the bounding box column

Just selecting using the bounding box column gives this explanation (I'm using the airports table to save typing in lng/lat all the time, and I know where the airport is and which admin regions it should be in)

explain analyse select t1.id from territories t1
join territory_geometries tg1 on (t1.id = tg1.territory_id and t1.territory_type > 2) ,airports a2
where ST_Within(a2.the_geom,tg1.bbox)
and a2.iata_code = 'YYZ';
                                                                                 QUERY PLAN                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..83.02 rows=57 width=4) (actual time=0.684..6.917 rows=12 loops=1)
   ->  Nested Loop  (cost=0.00..16.68 rows=57 width=4) (actual time=0.659..6.496 rows=15 loops=1)
         Join Filter: _st_within(a2.the_geom, tg1.bbox)
         ->  Index Scan using airports_iata on airports a2  (cost=0.00..8.27 rows=1 width=100) (actual time=0.039..0.045 rows=1 loops=1)
               Index Cond: ((iata_code)::text = 'YYZ'::text)
         ->  Index Scan using index_territory_geometries_on_bbox on territory_geometries tg1  (cost=0.00..8.40 rows=1 width=456) (actual time=0.592..6.171 rows=15 loops=1)
               Index Cond: (a2.the_geom && tg1.bbox)
               Filter: (a2.the_geom && tg1.bbox)
   ->  Index Scan using territories_pkey on territories t1  (cost=0.00..1.15 rows=1 width=4) (actual time=0.019..0.021 rows=1 loops=15)
         Index Cond: (t1.id = tg1.territory_id)
         Filter: (t1.territory_type > 2)
 Total runtime: 7.070 ms
(12 rows)

Which is pretty fast, but as you can see I'm getting too many matches because some of the bounding boxes overlap, so I need to take the results of this and do a final match with the
real geom column to filter down to just 3 rows. But when I add the matching clause I get this;-

explain analyse select t1.id,t1.name,t1.territory_type from
territories t1 join territory_geometries tg1 on (t1.id = tg1.territory_id and t1.territory_type > 2), airports a2
where a2.iata_code = 'YYZ'
and a2.the_geom && tg1.bbox
and ST_Within(a2.the_geom,tg1.the_geom) ;
                                                                                  QUERY PLAN                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..17.85 rows=1 width=18) (actual time=84964.441..202308.802 rows=3 loops=1)
   ->  Nested Loop  (cost=0.00..16.69 rows=1 width=4) (actual time=84964.365..202308.474 rows=4 loops=1)
         Join Filter: ((a2.the_geom && tg1.the_geom) AND _st_within(a2.the_geom, tg1.the_geom))
         ->  Index Scan using airports_iata on airports a2  (cost=0.00..8.27 rows=1 width=100) (actual time=0.039..0.042 rows=1 loops=1)
               Index Cond: ((iata_code)::text = 'YYZ'::text)
         ->  Index Scan using index_territory_geometries_on_bbox on territory_geometries tg1  (cost=0.00..8.40 rows=1 width=44685) (actual time=0.634..3.817 rows=15 loops=1)
               Index Cond: (a2.the_geom && tg1.bbox)
               Filter: (a2.the_geom && tg1.bbox)
   ->  Index Scan using territories_pkey on territories t1  (cost=0.00..1.15 rows=1 width=18) (actual time=0.071..0.073 rows=1 loops=4)
         Index Cond: (t1.id = tg1.territory_id)
         Filter: (t1.territory_type > 2)
 Total runtime: 202308.905 ms
(12 rows)

I don't know why explain analyse reports 12 rows, when I actually run the thing only 3 rows are returned. Anyway, as you can see the query planner decides to do the match on the full geometry column tg1.the_geom, before the match on the reduced geometry tg1.bbox, which is the wrong way round. Crucially, it's not doing an index scan on the full geometry column even though there's an index on it and I've vacuum analysed it umpteen times. I've also added cached bounding boxes to the geometries in that column but it makes no difference.

Simplifying the query to exclude the filter on the bbox column and the other extraneous table I get

 explain analyse select tg.id from territory_geometries tg, airports a where ST_Within(a.the_geom,tg.the_geom) and a.iata_code = 'YYZ';
                                                                       QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..16.69 rows=57 width=4) (actual time=82110.289..209649.492 rows=4 loops=1)
   Join Filter: _st_within(a.the_geom, tg.the_geom)
   ->  Index Scan using airports_iata on airports a  (cost=0.00..8.27 rows=1 width=100) (actual time=32.846..32.852 rows=1 loops=1)
         Index Cond: ((iata_code)::text = 'YYZ'::text)
   ->  Index Scan using terr_geom_gist on territory_geometries tg  (cost=0.00..8.41 rows=1 width=44233) (actual time=998.460..4147.349 rows=15 loops=1)
         Index Cond: (a.the_geom && tg.the_geom)
         Filter: (a.the_geom && tg.the_geom)
 Total runtime: 209657.567 ms

This time it is using the gist index on the full geometry column, but it's still blindingly slow, over 200 seconds as compared with 7ms using the bbox column.

Anyway, I've got a workaround for the moment. I just code my client to do the lookup in two parts, the first using the bbox to get a reduced list, and the second to step through the reduced list to check against the full geometry. Messy but it works and I can move on. Though I would like to know how to get the query planner to do things the other way around so that it select on the bbox column first and then works on the full geometry.

Ta

John Small






_______________________________________________
postgis-users mailing list
postgis-users@...
http://postgis.refractions.net/mailman/listinfo/postgis-users