Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: speed up SELECT query in table with millions of rows

I have a table with > 4.5 million rows and my SELECT query is far too slow for my needs.

The table is created with:

CREATE TABLE all_legs (
                carrier TEXT,
                dep_hub TEXT,
                arr_hub TEXT,
                dep_dt TIMESTAMP WITH TIME ZONE,
                arr_dt TIMESTAMP WITH TIME ZONE,
                price_ct INTEGER,
                ... 5 more cols ...,
                PRIMARY KEY (carrier, dep_hub, arr_hub, dep_dt, arr_dt, ...3 other cols...)
                )

When I want to SELECT all rows for a certain date, the query is too slow; it takes between 12sec and 20 seconds. My aim is that it takes max 1 sec. I expect the query to return between 0.1% and 1% of the rows contained in the table.

The query is quite simple:

SELECT * FROM all_legs WHERE dep_dt::date = '2017-08-15' ORDER BY price_ct ASC

EXPLAIN ANALYZE returns:

Sort  (cost=197154.69..197212.14 rows=22982 width=696) (actual time=14857.300..14890.565 rows=31074 loops=1)
  Sort Key: price_ct
  Sort Method: external merge  Disk: 5256kB
  ->  Seq Scan on all_legs  (cost=0.00..188419.85 rows=22982 width=696) (actual time=196.738..14581.143 rows=31074 loops=1)
        Filter: ((dep_dt)::date = '2017-08-15'::date)
        Rows Removed by Filter: 4565249
Planning time: 0.572 ms
Execution time: 14908.274 ms

Note: I learned yesterday about this command, so I am still not able to fully understand all that is returned.

I have tried using index-only scans, as suggested here, by running the command: CREATE index idx_all_legs on all_legs(dep_dt); but I did not notice any difference in running time. I also tried creating the index for all columns, as I want all columns return.

Another thought was sorting all rows by dep_dt, so then the search of all rows fulfilling the condition should be much faster as they would not be scattered. Unfortunately, I don't know how to implement this.

Is there a way to make it as fast as I am aiming to?


Solution

As suggested in the Laurenz' answer, by adding an index CREATE INDEX IF NOT EXISTS idx_dep_dt_price ON all_legs(dep_dt, price_ct); and adapting the condition in the SELECT to WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00' has reduced the running time to 1/4. Even if it is a very good improvement, that means running times between 2 and 6 seconds.

Any additional idea to reduce the running time even further would be appreciated.

like image 670
J0ANMM Avatar asked Jul 26 '17 07:07

J0ANMM


2 Answers

The index won't help.

Two solutions:

  1. You chould either change the query to:

    WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
    

    Then the index can be used.

  2. Create an index on an expression:

    CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));
    

    (or a different time zone) and change the query to

    WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'
    

    The AT TIME ZONE is necessary because otherwise the result of the cast would depend on your current TimeZone setting.

The first solution is simpler, but the second has the advantage that you can add price_ct to the index like this:

CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);

Then you don't need a sort any more, and your query will be as fast as it can theoretically get.

like image 158
Laurenz Albe Avatar answered Sep 22 '22 16:09

Laurenz Albe


The index does not help because you use

WHERE dept_dt::date=constant

This seems fine to a beginner, but to the database, it looks like:

WHERE convert_timestamp_to_date(dep_ts)=constant

With convert_timestamp_to_date() being an arbitrary function (I just came up with the name, don't look it up in the docs). In order to use the index on dep_ts, the DB would have to reverse the function convert_timestamp_to_date into something like convert_date_to_timestamp_range (because a date corresponds to a range of timestamps, not just one timestamp), and then rewrite the WHERE as Laurenz did.

Since there are many such functions, the database developers didn't bother to maintain a huge table of how to invert them. Also it would only help for special cases. For example, if you specified a date range in your WHERE instead of a "=constant" then it would be yet another special case. It's your job to handle this ;)

Also, an index on (dep_dt,price_ct) won't speed up the sort as the first column is a timestamp, so the rows are not ordered in the index the way you want. You'd need an index on (dept_dt::date, price_ct) to eliminate the sort.

Now, which index to create? This depends...

If you also use timestamp range queries like "WHERE dep_dt BETWEEN ... AND ..." then the index on dep_dt needs to be the original timestamp type. In this case, creating another index on the same column, but converted to date, would be unnecessary (all indexes have to be updated on writes, so unnecessary indexes slow down inserts/updates). However, if you use the index on (dep_ts::date,price_ct) lots and lots of times and eliminating the sort is really important for you, then it may make sense. It's a tradeoff.

like image 44
bobflux Avatar answered Sep 24 '22 16:09

bobflux