Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Billions rows in PostgreSql: partition or not to partition?

What i have:

  • Simple server with one xeon with 8 logic cores, 16 gb ram, mdadm raid1 of 2x 7200rpm drives.
  • PostgreSql
  • A lot of data to work with. Up to 30 millions of rows are being imported per day.
  • Time - complex queries can be executed up to an hour

Simplified schema of table, that will be very big:

id| integer | not null default nextval('table_id_seq'::regclass)
url_id      | integer | not null
domain_id   | integer | not null
position    | integer | not null

The problem with the schema above is that I don't have the exact answer on how to partition it. Data for all periods is going to be used (NO queries will have date filters).

I thought about partitioning on "domain_id" field, but the problem is that it is hard to predict how many rows each partition will have.

My main question is:

Does is make sense to partition data if i don't use partition pruning and i am not going to delete old data?

What will be pros/cons of that ?

How will degrade my import speed, if i won't do partitioning?

Another question related to normalization:

Should url be exported to another table?

Pros of normalization

  • Table is going to have rows with average size of 20-30 bytes.
  • Joins on "url_id" are supposed to be much faster than on "url" field

Pros of denormalization

  • Data can be imported much, much faster, as i don't have to make lookup into "url" table before each insert.

Can anybody give me any advice? Thanks!

like image 466
Oleg Golovanov Avatar asked May 03 '12 13:05

Oleg Golovanov


People also ask

Can Postgres handle a billion rows?

As commercial database vendors are bragging about their capabilities we decided to push PostgreSQL to the next level and exceed 1 billion rows per second to show what we can do with Open Source. To those who need even more: 1 billion rows is by far not the limit - a lot more is possible. Watch and see how we did it.

How many partitions should I have Postgres?

Before PostgreSQL 12, performance could be affected by too many table partitions, and it was recommended to have 100 or less partitions. But as of version 12, partitioning performance has been greatly improved, so that even thousands of partitions can be processed efficiently.

Does partitioning a table improve performance?

With table partitioning, you can either choose to re-index the table data in its entirety or selectively. Doing so selectively will drastically reduce the time to execute this daily action. For some clients, I have seen between 50 and 75% improvement in performance.


1 Answers

Partitioning is most useful if you are going to either have selection criteria in most queries which allow the planner to skip access to most of the partitions most of the time, or if you want to periodically purge all rows that are assigned to a partition, or both. (Dropping a table is a very fast way to delete a large number of rows!) I have heard of people hitting a threshold where partitioning helped keep indexes shallower, and therefore boost performance; but really that gets back to the first point, because you effectively move the first level of the index tree to another place -- it still has to happen.

On the face of it, it doesn't sound like partitioning will help.

Normalization, on the other hand, may improve performance more than you expect; by keeping all those rows narrower, you can get more of them into each page, reducing overall disk access. I would do proper 3rd normal form normalization, and only deviate from that based on evidence that it would help. If you see a performance problem while you still have disk space for a second copy of the data, try creating a denormalized table and seeing how performance is compared to the normalized version.

like image 175
kgrittn Avatar answered Sep 29 '22 03:09

kgrittn