Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Order by query on timestamp column is very slow

Tags:

postgresql

Below query taking around 15 seconds to return data despite of having an index, and the id as primary key.

select id from my_table order by insert_date offset 0 limit 1

The explain analyze is as below

"Limit  (cost=1766417.72..1766417.72 rows=1 width=12) (actual time=32479.440..32479.441 rows=1 loops=1)"
"  ->  Sort  (cost=1766417.72..1797117.34 rows=12279848 width=12) (actual time=32479.437..32479.437 rows=1 loops=1)"
"        Sort Key: insert_date"
"        Sort Method: top-N heapsort  Memory: 25kB"
"        ->  Seq Scan on my_table  (cost=0.00..1705018.48 rows=12279848 width=12) (actual time=0.006..21338.401 rows=12108916 loops=1)"
"Total runtime: 32479.476 ms"

My table has few other columns. But the type for the insert_date is

insert_date timestamp without time zone NOT NULL DEFAULT now(),

I have an index on that particular date column which is

CREATE INDEX my_table_insert_date_indx
  ON my_table
  USING btree
  (insert_date)
TABLESPACE somexyz_idx_ts;

Few values from postgresql.conf file:

shared_buffers = more than 1GB    ## just for an example
temp_buffers = more than 1GB
work_mem = more than 1GB
maintenance_work_mem = more than 1GB
dynamic_shared_memory_type = posix
default_statistics_target = 10000
autovacuum = on
random_page_cost = 2.0
cpu_index_tuple_cost = 0.0005

I am using postgres 9.3 right now.

UPDATE::

I just ran the below query a while ago:

select insert_date, count(*) from my_table group by insert_date

and the top few from the result is:

"2015-04-02 00:00:00";3718104
"2015-04-03 00:00:00";6410253
"2015-04-04 00:00:00";538247
"2015-04-05 00:00:00";1228877
"2015-04-06 00:00:00";131248

I have around 12 million records on that table. And the above count is nearly close to that total.

Not sure, but could it be a problem that the index has been created on a column that is having tons of duplicate values? If it is true, then do we have any way around?

like image 677
Sabuj Hassan Avatar asked Oct 20 '22 02:10

Sabuj Hassan


1 Answers

Your query runs about 160000 times faster on my machine using both PostgreSQL 9.3 and 9.4. My machine is nothing special.

-- From PostgreSQL 9.4; 9.3 is similar.
show shared_buffers; -- 128MB
show temp_buffers; -- 8MB
show work_mem; -- 4MB
show maintenance_work_mem; -- 64MB
show dynamic_shared_memory_type; -- posix
show default_statistics_target; -- 100
show autovacuum; -- on
show random_page_cost; -- 4
show cpu_index_tuple_cost; -- 0.005

Preparation

Let's build a table. (You should have done this in your question.)

create table my_table (
  id serial primary key,
  insert_date timestamp not null
);

-- Round numbers of rows.
insert into my_table(insert_date)
select timestamp '2015-04-02 00:00:00'
from generate_series(1, 3000000) n;

insert into my_table(insert_date)
select timestamp '2015-04-03 00:00:00'
from generate_series(1, 6000000) n;

insert into my_table(insert_date)
select timestamp '2015-04-04 00:00:00'
from generate_series(1, 500000) n;

insert into my_table(insert_date)
select timestamp '2015-04-05 00:00:00'
from generate_series(1, 1200000) n;

insert into my_table(insert_date)
select timestamp '2015-04-06 00:00:00'
from generate_series(1, 131000) n;

Create an index, and update statistics.

create index on my_table (insert_date);
analyze my_table;

PostgreSQL 9.4

Now, what kind of execution plan do we get from your first query?

explain analyze 
select id from my_table order by insert_date offset 0 limit 1;
"Limit  (cost=0.43..0.48 rows=1 width=12) (actual time=0.014..0.014 rows=1 loops=1)"
"  ->  Index Scan using my_table_insert_date_idx on my_table  (cost=0.43..540656.27 rows=11200977 width=12) (actual time=0.012..0.012 rows=1 loops=1)"
"Planning time: 0.195 ms"
"Execution time: 0.032 ms"

PostgreSQL 9.3

explain analyze 
select id from my_table order by insert_date offset 0 limit 1;
"Limit  (cost=0.43..0.47 rows=1 width=12) (actual time=0.058..0.059 rows=1 loops=1)"
"  ->  Index Scan using my_table_insert_date_idx on my_table  (cost=0.43..339814.36 rows=10830995 width=12) (actual time=0.057..0.057 rows=1 loops=1)"
"Total runtime: 0.098 ms"

Your query

select id from my_table order by insert_date offset 0 limit 1;

is indeterminate. There are 3 million rows that have the lowest insert_date (the date that will appear first, according to the ORDER BY clause). You pick one of those 3 million. PostgreSQL doesn't guarantee you'll get the same id every time.

If you don't care which of the 3 million ids it returns, you can express the query differently. But I don't think expressing it differently will give you 160k times speedup.

Some of the settings you included can be changed for a particular query. So you can do something like this.

-- Don't commit or rollback . . . 
begin transaction;
set local work_mem = '8 MB';

explain analyze 
select id from my_table order by insert_date offset 0 limit 1;
-- Displays the result. 

Commit or rollback manually.

commit;

Your work_mem setting returns to the value set on server start.

show work_mem; -- 4MB
like image 77
Mike Sherrill 'Cat Recall' Avatar answered Oct 21 '22 23:10

Mike Sherrill 'Cat Recall'