I'm currently working on a complex sorting problem in Postgres 9.2 You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC). But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)
You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table
for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With