Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very slow bitmap heap scan in Postgres

I have the following simple table that contains traffic measurement data:

CREATE TABLE "TrafficData"
(
  "RoadID" character varying NOT NULL,
  "DateID" numeric NOT NULL,
  "ExactDateTime" timestamp NOT NULL,
  "CarsSpeed" numeric NOT NULL,
  "CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");

The column RoadID uniquely identifies the road whose data is being recorded, while DateID identifies the day of the year (1..365) of the data - basically a rounded off representation of ExactDateTime.

I have about 100.000.000 rows; there are 1.000 distinct values in the column "RoadID" and 365 distinct values in the column "DateID".

I then run the following query:

SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;

This takes up to three mind-boggling seconds to finish, and I cannot for the life of me figure out WHY.

EXPLAIN ANALYZE gives me the following output:

Bitmap Heap Scan on "TrafficData"  (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
  Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
  ->  Bitmap Index Scan on "RoadDate_Idx"  (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
        Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms

My specs:

  • Windows 7
  • Postgres 9.0
  • 4GB RAM

I'd greatly appreciate any helpful pointers!

like image 369
TroutKing Avatar asked Oct 24 '10 16:10

TroutKing


2 Answers

The slow part is obviosly fetching the data from the tables, since the index access seems to be very fast. You might either optimize your RAM usage parameters (see http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html), or optimize the layout of the data in the table by issuing a CLUSTER command (see http://www.postgresql.org/docs/8.3/static/sql-cluster.html).

CLUSTER "TrafficData" USING "RoadDate_Idx";

should do it.

like image 160
Daniel Avatar answered Sep 28 '22 01:09

Daniel


Adding to Daniel's answer, the cluster operation is a one off process that rearranged the data on disk. The intent is to get your 2000 results rows from fewer disk blocks.

As this is dummy data, being used to find out how you can quickly query it, I'd recommend reloading it, in a pattern closer to how it will be loaded as it is generated. I imagine that the data is generated one day at a time, which will effectively result in strong correlation between DateID and the location on disk. If that is the case, then I'd either cluster by DateID, or split your test data into 365 separate loads, and reload it.

Without that, and having randomly generated data, you're most likely having to perform over 2000 seeks of your disk head.

I'd also check that anything else you're running on Windows 7 isn't adding time to those reads that you don't need, such as ensuring that the blocks read do not contain virus signatures, or concurrently performing an automatically scheduled disk defragmentation (resulting in the disk head hardly ever being anywhere close to where it was last time a database block was read).

like image 31
Stephen Denne Avatar answered Sep 28 '22 01:09

Stephen Denne