I have a table, clustered on two columns (point of sale and product ID). The only index is in those two columns, and the table is clustered on those columns.
On a weekly basis, I update other columns in the table. When I do that, the size of the table and relations increases by about 5 times. I then cluster the table, and the size reverts to what it was pre-update.
This seems strange to me. If I were updating the indexed columns, I'd expect some bloat that I'd need to deal with by vacuuming, but since the indexed columns are not modified by any of the updates, I don't understand why updating the table would lead to an increase in size.
Is this working as expected, or does this point to a problem in my configuration? Is there a way to stop this?
[Postgres 9.1 on Windows 7]
PostgreSQL normally stores its table data in chunks of 8KB. The number of these blocks is limited to a 32-bit signed integer (just over two billion), giving a maximum table size of 16TB.
Aggregations vs. If you're simply filtering the data and data fits in memory, Postgres is capable of parsing roughly 5-10 million rows per second (assuming some reasonable row size of say 100 bytes). If you're aggregating then you're at about 1-2 million rows per second.
PostgreSQL implements multiversioning by keeping the old version of the table row in the table – an UPDATE adds a new row version (“tuple”) of the row and marks the old version as invalid. In many respects, an UPDATE in PostgreSQL is not much different from a DELETE followed by an INSERT .
Even without indexed columns, PostgreSQL still has to do an MVCC update where it writes a new row then later vacuums and discards the old one. Otherwise it couldn't roll back a transaction if there was an error midway through or it crashed. (PostgreSQL doesn't have an undo log, it uses the heap instead).
HOT updates can only be done if there's enough free space in a page to avoid having to write the new row to a different page, where new index entries must then be created. So PostgreSQL still has to write new rows to new pages on the end of the table, even though you aren't updating indexed columns, because there's just nowhere to put the new row versions on the current pages.
I'd usually only expect a doubling of space, but if you're doing a series of updates without vacuum catching up in between then more increases would be expected. Try to do all your updates in one pass or VACUUM
between passes.
To make the updates faster at the cost of some disk space, ALTER TABLE
to set a non-100 FILLFACTOR
on your table before you CLUSTER
it. I suggest 45
, enough room for one new version of each row plus a little wiggle space. That'll make the table twice the size but reduce the churn of all that rewriting. It'll let HOT updates occur and also speed up updates because there's no need to extend the relation all the time.
Best of all - try to find a way to avoid having to bulk update the whole table periodically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With