Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing the right partitioning rule

I am setting up a new PostgreSQL 9 database that will contains millions (or maybe billions) of rows. So I decided to partition data using PostgreSQL inheritance.

I created a master table like this (simplified for example) :

CREATE TABLE mytable
(
  user_id integer,
  year integer,
  CONSTRAINT pk_mytable PRIMARY KEY (user_id, year)
);

And 10 partition tables :

CREATE TABLE mytable_0 () INHERITS (mytable);
CREATE TABLE mytable_1 () INHERITS (mytable);
...
CREATE TABLE mytable_9 () INHERITS (mytable);

I know that rows will always be accessed from the application using a unique user_id condition. So I would like to spread data "quite" equally over the 10 tables using a rule based on user_id.

To tune queries over the master table, my first idea was to use a modulus check constraint :

ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 0);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id % 10 = 1);
...

The problem is, when I query the master table "mytable" with the condition on user_id, PostgreSQL analyzer check all the tables and do not benefit from the check constraint :

EXPLAIN SELECT * FROM mytable WHERE user_id = 12345;

"Result  (cost=0.00..152.69 rows=64 width=36)"
"  ->  Append  (cost=0.00..152.69 rows=64 width=36)"
"        ->  Seq Scan on mytable  (cost=0.00..25.38 rows=6 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_0 mytable  (cost=0.00..1.29 rows=1 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_1 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"
...
"        ->  Seq Scan on mytable_9 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"

Whereas if I use a classic CHECK CONSTRAINT like this (and the repartition that match that rule) :

ALTER TABLE mytable_0 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 1 AND 10000);
ALTER TABLE mytable_1 ADD CONSTRAINT mytable_user_id_check CHECK (user_id BETWEEN 10001 AND 20000);
...

it will scan only tables that match the condition (mytable and mytable_1 in this example) :

"Result  (cost=0.00..152.69 rows=64 width=36)"
"  ->  Append  (cost=0.00..152.69 rows=64 width=36)"
"        ->  Seq Scan on mytable  (cost=0.00..25.38 rows=6 width=36)"
"              Filter: (user_id = 12345)"
"        ->  Seq Scan on mytable_1 mytable  (cost=0.00..1.52 rows=1 width=36)"
"              Filter: (user_id = 12345)"

But using such check constraint is difficult to maintain because the range of users that will be populated in the tables will change over the years. thousands first, maybe millions or more in the near future...

What rule could I use to partition equally my data over the 10 tables that could benefit from a check constraint so that a SELECT on the master table will scan only the right table...?

Thanks,

Nico

like image 357
Nicolas Payart Avatar asked Nov 25 '11 10:11

Nicolas Payart


2 Answers

The limitation is with the planner rather than the partioning itself. It's covered in the manual in some detail:

http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

There are two things you mention though that need to be considered.

Firstly, you say all access will be through the primary key. This means you will gain no performance benefits from partitioning (at least not in normal usage). The index on each partition will be smaller, but PG needs to pick which partition to check first. Where you will gain is if you need to reindex or similar - you can reindex each partition separately.

Secondly, you say you might have anything from thousands to billions of rows. This leads me to two conclusions:

  1. Perhaps leave the decision to later. Wait until you need to partition.
  2. You are unlikely to want exactly 10 partitions with two thousand rows and two billion.

If you are going to partition, do it by range - say 100,000 rows or 1 million per partition. Add a cron-job to check the maximum ID used and create a new partition if required (once a day perhaps).

Personally, though, I'd leave it until I needed it. Maybe have a single partition as a catch-all if you think it's more likely than not you'll need them later.

like image 62
Richard Huxton Avatar answered Nov 10 '22 09:11

Richard Huxton


The WHERE needs to be on the same expression as the CHECK, i. e., the query planner won't realize that user_id = 12345 allows the conclusion that user_id % 10 = 5. Try

EXPLAIN SELECT * FROM mytable WHERE user_id = 12345 AND user_id % 10 = 5;

That said, I would like to second Richard Huxton's answer in that you might want to postpone partitioning until you have more information on the size of the data set, th eidea being to avoid premature optimization. Postgres can be very fast on rather large tables, it will take you quite far without partitioning.

like image 1
Hanno Fietz Avatar answered Nov 10 '22 10:11

Hanno Fietz