I have a table that will grow by 10M of rows every year.
The table have 10 columns, call them c1, c2, c3, ..., c10.
I will use the WHERE clause, potentially on 8 of them.
To be more specific: every time I will query the table, there will ALWAYS be a WHERE clause on column c10 (it's a date, I could search for equality or range).
The other 7 possible searchable columns, will not follow any schema. I could search for:
... and all other possible combinations.
So, in a WHERE clause, c10 will always be present and the others could be present in any combination (even not present at all).
What indexing strategy could improve performance in this scenario? I think that the right thing to do, is to create a single index for every column. There could be any performance improvements using a multicolumn index?
From what I know, you will gain performance with a multicolumn index on (c1, c2, c3), only for queries that use c1, c2, c3 or c1, c2, or c1, in that order. But like I said, the only thing I can assume in my scenario, is that c10 will always be present in a WHERE clause (it can also be the first clause, if that helps)
It can also be a disadvantage: The rows are returned in the order they are stored by Postgres in memory or on disk — ie, no order in particular. Since sorts can be very expensive, this can tilt the balance in favour of a multi-column index if you want to sort a lot of rows.
A multicolumn index should be created when data satisfying multiple conditions is to be retrieved, such as when a complex-condition retrieval using the AND operator with multiple columns as the key is executed.
In Postgres, a B-Tree index is what you most commonly want Using an index is much faster than a sequential scan because it may only have to read a few pages as opposed to sequentially scanning thousands of them (when you're returning only a few records). If you run a standard CREATE INDEX it creates a B-tree for you.
B-Tree is the default index type for the CREATE INDEX command in PostgreSQL. It is compatible with all data types, and it can be used, for instance, to retrieve NULL values and work with caching. B-Tree is the most common index type, suitable for most cases.
To answer the question of what kind of index should we use, we can create a simple test. first, we create a database, table, and index.
CREATE DATABASE index_test;
CREATE TABLE single_column(a int, b int, c int);
CREATE TABLE multi_column(a int, b int, c int);
CREATE INDEX single_column_a_idx ON single_column (a);
CREATE INDEX single_column_b_idx ON single_column (b);
CREATE INDEX single_column_c_idx ON single_column (c);
CREATE INDEX multi_column_idx ON multi_column (a, b, c);
populate the table with random data.
-- this function will be used for random number generation
CREATE OR REPLACE FUNCTION random_in_range(INTEGER, INTEGER) RETURNS INTEGER AS $$
SELECT floor(($1 + ($2 - $1 + 1) * random()))::INTEGER;
$$ LANGUAGE SQL;
INSERT INTO single_column(a, b, c)
SELECT random_in_range(1, 100),
random_in_range(1, 100),
random_in_range(1, 100)
FROM generate_series(1, 1000000);
INSERT INTO multi_column(a, b, c)
SELECT random_in_range(1, 100),
random_in_range(1, 100),
random_in_range(1, 100)
FROM generate_series(1, 1000000);
run the test.
EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3;
EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3;
EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3;
EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11;
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3 AND b > 10 AND c <= 11;
result
index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on single_column (cost=3925.39..13926.49 rows=367608 width=12) (actual time=5.802..44.904 rows=20070 loops=1)
Recheck Cond: (a < 3)
Heap Blocks: exact=5269
-> Bitmap Index Scan on single_column_a_idx (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.018..4.019 rows=20070 loops=1)
Index Cond: (a < 3)
Planning Time: 0.325 ms
Execution Time: 46.589 ms
(7 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on single_column (cost=3925.39..13926.49 rows=367608 width=12) (actual time=6.630..26.814 rows=19902 loops=1)
Recheck Cond: (b < 3)
Heap Blocks: exact=5296
-> Bitmap Index Scan on single_column_b_idx (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.852..4.853 rows=19902 loops=1)
Index Cond: (b < 3)
Planning Time: 0.270 ms
Execution Time: 28.762 ms
(7 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on single_column (cost=3925.39..13926.49 rows=367608 width=12) (actual time=5.896..25.304 rows=19946 loops=1)
Recheck Cond: (c < 3)
Heap Blocks: exact=5274
-> Bitmap Index Scan on single_column_c_idx (cost=0.00..3833.49 rows=367608 width=0) (actual time=4.125..4.126 rows=19946 loops=1)
Index Cond: (c < 3)
Planning Time: 0.270 ms
Execution Time: 27.136 ms
(7 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on multi_column (cost=8569.39..18570.49 rows=367608 width=12) (actual time=7.760..67.173 rows=19938 loops=1)
Recheck Cond: (a < 3)
Heap Blocks: exact=5267
-> Bitmap Index Scan on multi_column_idx (cost=0.00..8477.49 rows=367608 width=0) (actual time=6.008..6.008 rows=19938 loops=1)
Index Cond: (a < 3)
Planning Time: 0.564 ms
Execution Time: 68.630 ms
(7 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..13481.03 rows=18667 width=12) (actual time=1.451..135.028 rows=19897 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on multi_column (cost=0.00..10614.33 rows=7778 width=12) (actual time=0.038..61.993 rows=6632 loops=3)
Filter: (b < 3)
Rows Removed by Filter: 326701
Planning Time: 1.123 ms
Execution Time: 136.128 ms
(8 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..13627.63 rows=20133 width=12) (actual time=0.957..135.119 rows=19860 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on multi_column (cost=0.00..10614.33 rows=8389 width=12) (actual time=0.035..66.760 rows=6620 loops=3)
Filter: (c < 3)
Rows Removed by Filter: 326713
Planning Time: 0.225 ms
Execution Time: 136.239 ms
(8 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on single_column (cost=1424.66..5716.83 rows=2110 width=12) (actual time=21.694..26.123 rows=2000 loops=1)
Recheck Cond: ((a < 3) AND (c <= 11))
Filter: (b > 10)
Rows Removed by Filter: 230
Heap Blocks: exact=1833
-> BitmapAnd (cost=1424.66..1424.66 rows=2338 width=0) (actual time=20.981..20.983 rows=0 loops=1)
-> Bitmap Index Scan on single_column_a_idx (cost=0.00..230.43 rows=21067 width=0) (actual time=3.932..3.932 rows=20070 loops=1)
Index Cond: (a < 3)
-> Bitmap Index Scan on single_column_c_idx (cost=0.00..1192.92 rows=111000 width=0) (actual time=16.080..16.080 rows=110276 loops=1)
Index Cond: (c <= 11)
Planning Time: 1.812 ms
Execution Time: 26.742 ms
(12 rows)
index_test=# EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3 AND b > 10 AND c <= 11;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using multi_column_idx on multi_column (cost=0.42..642.38 rows=2071 width=12) (actual time=0.329..2.086 rows=1953 loops=1)
Index Cond: ((a < 3) AND (b > 10) AND (c <= 11))
Heap Fetches: 0
Planning Time: 0.176 ms
Execution Time: 2.165 ms
(5 rows)
single_column
table will always use the index in any condition.EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM single_column WHERE b < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM single_column WHERE c < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM single_column WHERE a < 3 AND b > 10 AND c <= 11; -- index used
multi_column
table, the index will only be used if the column in the query is the same as the first column in the index definition.EXPLAIN ANALYZE SELECT * FROM multi_column WHERE a < 3; -- index used
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE b < 3; -- index not used
EXPLAIN ANALYZE SELECT * FROM multi_column WHERE c < 3; -- index not used
single_column
table can use indexes on multi column WHERE, multi_column
table is faster.multi_column
table can use indexes on single column WHERE, single_column
table is faster.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With