My question relates to the innards of how Postgres works: I have a table: <pre class="prettyprint"><code> CREATE TABLE A ( id SERIAL, name VARCHAR(32), type VARCHAR(32) NOT NULL, priority SMALLINT NOT NULL, x SMALLINT NOT NULL, y SMALLINT NOT NULL, start timestamp with time zone, end timestamp with time zone, state Astate NOT NULL, other_table_id1 bigint REFERENCES W, other_table_id2 bigint NOT NULL REFERENCES S, PRIMARY KEY(id) ); </code></pre> with additional indexes on other_table_id1, state and other_table_id2. The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.) I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once. At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change? I'm not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.

There is a definite cost to updating a larger row. A formula can help with this. If you do not split, your costs are Cost = xU + yS where: U = an update of the entire row (table is not split) S = cost of a select x,y = count of actions Then, if you split it, you are trying to figure this: Cost = gU1 + hU2 + xS1 + yS2 where U1 = update of smaller table (lower cost) U2 = update of larger table (lower cost) S1 = select from smaller table S2 = select from larger table g,h,x,y = how often the individual actions occur So if g >> h, it pays to break them up. Especially if x >> y then it really pays. EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.

Postgres: one table with many columns or several tables with fewer columns?

Tags:

sql

postgresql

database-design

data-modeling

My question relates to the innards of how Postgres works:

I have a table:


CREATE TABLE A (
   id SERIAL,  
   name VARCHAR(32),
   type VARCHAR(32) NOT NULL, 
   priority SMALLINT NOT NULL,
   x SMALLINT NOT NULL,
   y SMALLINT NOT NULL,
   start timestamp with time zone,
   end timestamp with time zone,
   state Astate NOT NULL,
   other_table_id1 bigint REFERENCES W,
   other_table_id2 bigint NOT NULL REFERENCES S,
   PRIMARY KEY(id)
);

with additional indexes on other_table_id1, state and other_table_id2.

The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.)

I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once.

At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change?

I'm not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.

309

asked Feb 02 '11 18:02

Mayur Patel

1 Answers

There is a definite cost to updating a larger row.

A formula can help with this. If you do not split, your costs are

Cost = xU + yS

where:

U = an update of the entire row (table is not split)

S = cost of a select

x,y = count of actions

Then, if you split it, you are trying to figure this:

Cost = gU1 + hU2 + xS1 + yS2

where

U1 = update of smaller table (lower cost)

U2 = update of larger table (lower cost)

S1 = select from smaller table

S2 = select from larger table

g,h,x,y = how often the individual actions occur

So if g >> h, it pays to break them up. Especially if x >> y then it really pays.

EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.

118

answered Sep 23 '22 02:09

Ken Downs

Related questions
                            
                                In Oracle, a way for Updates to not lock rows?
                            
                                SQL Architecture: Is this a justified case to have only one table storing multiple entity types? (using a self JOIN)
                            
                                sql recursive function - to find managers
                            
                                sql server 2008 limit on exec statement
                            
                                New to Dynamic SQL Statements
                            
                                Simple sqlite question
                            
                                sql - adding time interval with skipping of certain period
                            
                                Issues with web application
                            
                                Using DBNull.Value with SqlParameter without knowing sqlDbType?
                            
                                difference between key column and non key one
                            
                                How to get column metadata from a table synonym
                            
                                How do I use a SQL function inside my SQL query
                            
                                MySQL Query Browser alternatives [closed]
                            
                                Does arrangement/order of tables in FROM clause make any difference in improving performance?
                            
                                difference between minus and except in Teradata
                            
                                How innodb tables are locked when ON INSERT trigger is processed?
                            
                                Intermittent System.IndexOutOfRangeException when reading a field from IDataReader
                            
                                architecture for high availability
                            
                                SQL unique contraint if a column is set to x
                            
                                What are the conditions for encountering a serialization failure?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With