I am setting up a new PostgreSQL 9 database that will contains millions (or maybe billions) of rows. So I decided to partition data using PostgreSQL inheritance.
I created a master table like this (simplified for example) :
CREATE TABLE mytable
(
user_id integer,
year integer,
CONSTRAINT pk_mytable PRIMARY KEY (user_id, year)
);
And 10 partition tables :
CREATE TABLE mytable_0 () INHERITS (mytable);
CREATE TABLE mytable_1 () INHERITS (mytable);
...
CREATE TABLE mytable_9 () INHERITS (mytable);
I know that rows will always be accessed from the application using a unique user_id condition. So I would like to spread data "quite" equally over the 10 tables using a rule based on user_id.
To tune queries over the master table, my first idea was to use a modulus check constraint :
ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 0);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 1);
...
The problem is, when I query the master table "mytable" with the condition on user_id, PostgreSQL analyzer check all the tables and do not benefit from the check constraint :
EXPLAIN SELECT * FROM mytable WHERE user_id = 12345;
"Result (cost=0.00..152.69 rows=64 width=36)"
" -> Append (cost=0.00..152.69 rows=64 width=36)"
" -> Seq Scan on mytable (cost=0.00..25.38 rows=6 width=36)"
" Filter: (user_id = 12345)"
" -> Seq Scan on mytable_0 mytable (cost=0.00..1.29 rows=1 width=36)"
" Filter: (user_id = 12345)"
" -> Seq Scan on mytable_1 mytable (cost=0.00..1.52 rows=1 width=36)"
" Filter: (user_id = 12345)"
...
" -> Seq Scan on mytable_9 mytable (cost=0.00..1.52 rows=1 width=36)"
" Filter: (user_id = 12345)"
Whereas if I use a classic CHECK CONSTRAINT like this (and the repartition that match that rule) :
ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 1 AND 10000);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 10001 AND 20000);
...
it will scan only tables that match the condition (mytable and mytable_1 in this example) :
"Result (cost=0.00..152.69 rows=64 width=36)"
" -> Append (cost=0.00..152.69 rows=64 width=36)"
" -> Seq Scan on mytable (cost=0.00..25.38 rows=6 width=36)"
" Filter: (user_id = 12345)"
" -> Seq Scan on mytable_1 mytable (cost=0.00..1.52 rows=1 width=36)"
" Filter: (user_id = 12345)"
But using such check constraint is difficult to maintain because the range of users that will be populated in the tables will change over the years. thousands first, maybe millions or more in the near future...
What rule could I use to partition equally my data over the 10 tables that could benefit from a check constraint so that a SELECT on the master table will scan only the right table...?
Thanks,
Nico
The limitation is with the planner rather than the partioning itself. It's covered in the manual in some detail:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
There are two things you mention though that need to be considered.
Firstly, you say all access will be through the primary key. This means you will gain no performance benefits from partitioning (at least not in normal usage). The index on each partition will be smaller, but PG needs to pick which partition to check first. Where you will gain is if you need to reindex or similar - you can reindex each partition separately.
Secondly, you say you might have anything from thousands to billions of rows. This leads me to two conclusions:
If you are going to partition, do it by range - say 100,000 rows or 1 million per partition. Add a cron-job to check the maximum ID used and create a new partition if required (once a day perhaps).
Personally, though, I'd leave it until I needed it. Maybe have a single partition as a catch-all if you think it's more likely than not you'll need them later.
The WHERE
needs to be on the same expression as the CHECK
, i. e., the query planner won't realize that user_id = 12345
allows the conclusion that user_id % 10 = 5
. Try
EXPLAIN SELECT * FROM mytable WHERE user_id = 12345 AND user_id % 10 = 5;
That said, I would like to second Richard Huxton's answer in that you might want to postpone partitioning until you have more information on the size of the data set, th eidea being to avoid premature optimization. Postgres can be very fast on rather large tables, it will take you quite far without partitioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With