Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my postgres table get much bigger under update?

Tags:

postgresql

I have a table, clustered on two columns (point of sale and product ID). The only index is in those two columns, and the table is clustered on those columns.

On a weekly basis, I update other columns in the table. When I do that, the size of the table and relations increases by about 5 times. I then cluster the table, and the size reverts to what it was pre-update.

This seems strange to me. If I were updating the indexed columns, I'd expect some bloat that I'd need to deal with by vacuuming, but since the indexed columns are not modified by any of the updates, I don't understand why updating the table would lead to an increase in size.

Is this working as expected, or does this point to a problem in my configuration? Is there a way to stop this?

[Postgres 9.1 on Windows 7]

like image 423
JamesF Avatar asked Jun 02 '14 05:06

JamesF


People also ask

How big is too big for a PostgreSQL table?

PostgreSQL normally stores its table data in chunks of 8KB. The number of these blocks is limited to a 32-bit signed integer (just over two billion), giving a maximum table size of 16TB.

Can Postgres handle 100 million rows?

Aggregations vs. If you're simply filtering the data and data fits in memory, Postgres is capable of parsing roughly 5-10 million rows per second (assuming some reasonable row size of say 100 bytes). If you're aggregating then you're at about 1-2 million rows per second.

How updates work in Postgres?

PostgreSQL implements multiversioning by keeping the old version of the table row in the table – an UPDATE adds a new row version (“tuple”) of the row and marks the old version as invalid. In many respects, an UPDATE in PostgreSQL is not much different from a DELETE followed by an INSERT .


1 Answers

Even without indexed columns, PostgreSQL still has to do an MVCC update where it writes a new row then later vacuums and discards the old one. Otherwise it couldn't roll back a transaction if there was an error midway through or it crashed. (PostgreSQL doesn't have an undo log, it uses the heap instead).

HOT updates can only be done if there's enough free space in a page to avoid having to write the new row to a different page, where new index entries must then be created. So PostgreSQL still has to write new rows to new pages on the end of the table, even though you aren't updating indexed columns, because there's just nowhere to put the new row versions on the current pages.

I'd usually only expect a doubling of space, but if you're doing a series of updates without vacuum catching up in between then more increases would be expected. Try to do all your updates in one pass or VACUUM between passes.

To make the updates faster at the cost of some disk space, ALTER TABLE to set a non-100 FILLFACTOR on your table before you CLUSTER it. I suggest 45, enough room for one new version of each row plus a little wiggle space. That'll make the table twice the size but reduce the churn of all that rewriting. It'll let HOT updates occur and also speed up updates because there's no need to extend the relation all the time.

Best of all - try to find a way to avoid having to bulk update the whole table periodically.

like image 87
Craig Ringer Avatar answered Oct 12 '22 11:10

Craig Ringer