There is table T(user, timestamp,...)
with 100 ml+ records (PostgreSQL 9.1).
A query of the form
SELECT *
FROM T
WHERE user='abcd'
ORDER BY timestamp
LIMIT 1
is using timestamp
index instead of user index when there are ~100000 user records.
Using timestamp index will always give poor results(20+ sec) as it eventually scans all records. Bypassing timestamp
index by changing query to use ORDER BY DATE(timestamp)
will result in query to resort to user index and give results which is less than 100 ms.
Why is postgresql ignoring user
index and is using timestamp
index instead (timestamp index will need to see all records)?
Are there any postgresql config params which can be altered to make the query use username index itself?
Nice question, I have came around this problem a while ago.
You should look into the number of user='abcd'
values in your stats like this:
SELECT attname, null_frac, ag_width, n_distinct,
most_common_vals, most_common_freqs, histogram_bounds
FROM pg_stats
WHERE table_name='T';
My guess is — this value occurs quite often and you'll find it in the most_common_vals
output.
Picking the same element from the most_common_freqs
you'll get the ratio for the value, multiply it by total number of rows (can be obtained from pg_class
) to get number of rows that are estimated to have 'abcd'
value.
Planner assumes all values to have a linear distribution. In reality things are different of course. Also, currently there're no correlated stats (although some work is being done in this direction).
So, let's take user='abcd'
value, having 0.001
ratio (per question) in the corresponding most_common_freqs
entry. This means value will occur every 1000 rows (assuming linear distribution). It appears, that if we'll scan table in any way we'll hit our user='abcd'
in some 1000 rows. Sounds that it should be fast! Planner "thinks" the same and chooses index on timestamp
column.
But it is not. If we'll assume, that your table T
contains logs of user activity, and user='abcd'
was on vacation for the last 3 weeks, then this means we'll have to read quite a lot of rows from the timestamp
index (3 weeks worth of data) before we actually hit the row we want. Well, you as DBA know this, but planner assumes linear distribution.
You'll have to trick the planner to use what you need, as you have more knowledge of your data.
Use OFFSET 0
trick with subquery:
SELECT *
FROM
(
SELECT * FROM T WHERE user='abcd' OFFSET 0
)
ORDER BY timestamp
LIMIT 1;
This trick protects query from inlining, therefore inner part is executed on it's own.
Use CTE
(named subquery):
WITH s AS (
SELECT * FROM T WHERE user='abcd'
)
SELECT *
FROM s
ORDER BY timestamp
LIMIT 1;
Per documentation:
A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries.
Use count(*)
for aggrgated queries:
SELECT min(session_id), count(*) -- instead of simply `min(session_id)`
FROM T
WHERE user='abcd'
ORDER BY timestamp
LIMIT 1;
This is not really applicable, but I wanted to mention it.
And please, consider upgrading to 9.3.
P.S. More on row estiamtes in the docs of course.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With