My question relates to the innards of how Postgres works:
I have a table:
CREATE TABLE A (
id SERIAL,
name VARCHAR(32),
type VARCHAR(32) NOT NULL,
priority SMALLINT NOT NULL,
x SMALLINT NOT NULL,
y SMALLINT NOT NULL,
start timestamp with time zone,
end timestamp with time zone,
state Astate NOT NULL,
other_table_id1 bigint REFERENCES W,
other_table_id2 bigint NOT NULL REFERENCES S,
PRIMARY KEY(id)
);
with additional indexes on other_table_id1, state and other_table_id2.
The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.)
I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once.
At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change?
I'm not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.
Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.
There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600.
In many cases, it may be best to split information into multiple related tables, so that there is less redundant data and fewer places to update.
PostgreSQL normally stores its table data in chunks of 8KB. The number of these blocks is limited to a 32-bit signed integer (just over two billion), giving a maximum table size of 16TB.
There is a definite cost to updating a larger row.
A formula can help with this. If you do not split, your costs are
Cost = xU + yS
where:
U = an update of the entire row (table is not split)
S = cost of a select
x,y = count of actions
Then, if you split it, you are trying to figure this:
Cost = gU1 + hU2 + xS1 + yS2
where
U1 = update of smaller table (lower cost)
U2 = update of larger table (lower cost)
S1 = select from smaller table
S2 = select from larger table
g,h,x,y = how often the individual actions occur
So if g >> h, it pays to break them up. Especially if x >> y then it really pays.
EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With