Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres: one table with many columns or several tables with fewer columns?

My question relates to the innards of how Postgres works:

I have a table:


CREATE TABLE A (
   id SERIAL,  
   name VARCHAR(32),
   type VARCHAR(32) NOT NULL, 
   priority SMALLINT NOT NULL,
   x SMALLINT NOT NULL,
   y SMALLINT NOT NULL,
   start timestamp with time zone,
   end timestamp with time zone,
   state Astate NOT NULL,
   other_table_id1 bigint REFERENCES W,
   other_table_id2 bigint NOT NULL REFERENCES S,
   PRIMARY KEY(id)
); 

with additional indexes on other_table_id1, state and other_table_id2.

The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.)

I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once.

At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change?

I'm not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.

like image 309
Mayur Patel Avatar asked Feb 02 '11 18:02

Mayur Patel


People also ask

Does number of columns affect performance in PostgreSQL?

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.

How many columns should a Postgres table have?

There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600.

Why is it better to have multiple separate tables?

In many cases, it may be best to split information into multiple related tables, so that there is less redundant data and fewer places to update.

How big is too big for a PostgreSQL table?

PostgreSQL normally stores its table data in chunks of 8KB. The number of these blocks is limited to a 32-bit signed integer (just over two billion), giving a maximum table size of 16TB.


1 Answers

There is a definite cost to updating a larger row.

A formula can help with this. If you do not split, your costs are

Cost = xU + yS

where:

U = an update of the entire row (table is not split)

S = cost of a select

x,y = count of actions

Then, if you split it, you are trying to figure this:

Cost = gU1 + hU2 + xS1 + yS2

where

U1 = update of smaller table (lower cost)

U2 = update of larger table (lower cost)

S1 = select from smaller table

S2 = select from larger table

g,h,x,y = how often the individual actions occur

So if g >> h, it pays to break them up. Especially if x >> y then it really pays.

EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.

like image 118
Ken Downs Avatar answered Sep 23 '22 02:09

Ken Downs