Calculating and saving space in PostgreSQL

Tags:

I have a table in pg like so:

CREATE TABLE t (
    a BIGSERIAL NOT NULL,               -- 8 b
    b SMALLINT,                         -- 2 b
    c SMALLINT,                         -- 2 b
    d REAL,                             -- 4 b
    e REAL,                             -- 4 b
    f REAL,                             -- 4 b
    g INTEGER,                          -- 4 b
    h REAL,                             -- 4 b
    i REAL,                             -- 4 b
    j SMALLINT,                         -- 2 b
    k INTEGER,                          -- 4 b
    l INTEGER,                          -- 4 b
    m REAL,                             -- 4 b
    CONSTRAINT a_pkey PRIMARY KEY (a)
);

The above adds up to 50 bytes per row. My experience is that I need another 40% to 50% for system overhead, without even any user-created indexes to the above. So, about 75 bytes per row. I will have many, many rows in the table, potentially upward of 145 billion rows, so the table is going to be pushing 13-14 terabytes. What tricks, if any, could I use to compact this table? My possible ideas below ...

Convert the real values to integer. If they can stored as smallint, that is a saving of 2 bytes per field.

Convert the columns b .. m into an array. I don't need to search on those columns, but I do need to be able to return one column's value at a time. So, if I need column g, I could do something like

SELECT a, arr[5] FROM t;

Would I save space with the array option? Would there be a speed penalty?

Any other ideas?

776

asked Jun 03 '10 13:06

punkish

4 Answers

"Column Tetris"

Actually, you can do something, but this needs deeper understanding. The keyword is alignment padding. Every data type has specific alignment requirements.

You can minimize space lost to padding between columns by ordering them favorably. The following (extreme) example would waste a lot of physical disk space:

CREATE TABLE t (     e int2    -- 6 bytes of padding after int2   , a int8   , f int2    -- 6 bytes of padding after int2   , b int8   , g int2    -- 6 bytes of padding after int2   , c int8   , h int2    -- 6 bytes of padding after int2   , d int8)

To save 24 bytes per row, use instead:

CREATE TABLE t (     a int8   , b int8   , c int8   , d int8   , e int2   , f int2   , g int2   , h int2)   -- 4 int2 occupy 8 byte (MAXALIGN), no padding at the end

db<>fiddle here
_{Old sqlfiddle}

As a rule of thumb, if you put 8-byte columns first, then 4-bytes, 2-bytes and 1-byte columns last you can't go wrong.

boolean, uuid (!) and a few other types need no alignment padding. text, varchar and other "varlena" (variable length) types nominally require "int" alignment (4 bytes on most machines). But I observed no alignment padding in disk format (unlike in RAM). Eventually, I found the explanation in a note in the source code:

Note also that we allow the nominal alignment to be violated when storing "packed" varlenas; the TOAST mechanism takes care of hiding that from most code.

So "int" alignment is only enforced when the (possibly compressed) datum including a single leading length-byte exceeds 127 bytes. Then varlena storage switches to four leading bytes and requires "int" alignment.

Normally, you may save a couple of bytes per row at best playing "column tetris". None of this is necessary in most cases. But with billions of rows it can mean a couple of gigabytes easily.

You can test the actual column / row size with the function pg_column_size().
Some types occupy more space in RAM than on disk (compressed or "packed" format). You can get bigger results for constants (RAM format) than for table columns when testing the same value (or row of values vs. table row) with pg_column_size().

Finally, some types can be compressed or "toasted" (stored out of line) or both.

Overhead per tuple (row)

4 bytes per row for the item identifier - not subject to above considerations.
And at least 24 bytes (23 + padding) for the tuple header. The manual on Database Page Layout:

There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data.

For the padding between header and user data, you need to know MAXALIGN on your server - typically 8 bytes on a 64-bit OS (or 4 bytes on a 32-bit OS). If you are not sure, check out pg_controldata.

Run the following in your Postgres binary dir to get a definitive answer:

./pg_controldata /path/to/my/dbcluster

The manual:

The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGN distance for the platform.

So you typically get the storage optimum by packing data in multiples of 8 bytes.

There is nothing to gain in the example you posted. It's already packed tightly. 2 bytes of padding after the last int2, 4 bytes at the end. You could consolidate the padding to 6 bytes at the end, which wouldn't change anything.

Overhead per data page

Data page size is typically 8 KB. Some overhead / bloat at this level, too: Remainders not big enough to fit another tuple, and more importantly dead rows or a percentage reserved with the FILLFACTOR setting.

There are a couple of other factors for size on disk to take into account:

How many records can I store in 5 MB of PostgreSQL on Heroku?
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
Configuring PostgreSQL for read performance

Array types?

With an array type like you were evaluating, you would add 24 bytes of overhead for the type. Plus, array elements occupy space as usual. Nothing to gain there.

101

answered Sep 17 '22 14:09

Erwin Brandstetter

I see nothing to gain (and something to lose) in storing several numeric fields in an array.

The size of each numerical type is clearly documented, you should simply use the smallest sized type compatible with your desired range-resolution; and that's about all you can do.

I don't think (but I'm not sure) if there is some byte alignment requirement for the columns along a row, in that case a reordering of the columns could alter the space used - but I don't think so.

BTW, there is a fix overhead per row, about 23 bytes.

answered Sep 20 '22 14:09

leonbloy

From this great documentation: https://www.2ndquadrant.com/en/blog/on-rocks-and-sand/

For a table you already have, or perhaps one you're making in development, named my_table, this query will give the optimal order left to right.

SELECT a.attname, t.typname, t.typalign, t.typlen
FROM pg_class c
JOIN pg_attribute a ON (a.attrelid = c.oid)
JOIN pg_type t ON (t.oid = a.atttypid)
WHERE c.relname = 'my_table'
 AND a.attnum >= 0
ORDER BY t.typlen DESC

answered Sep 21 '22 14:09

jboxxx

Here is a cool tool regarding the Erwin's columns reordering suggestion: https://github.com/NikolayS/postgres_dba

It has the exact command for that -- p1:

enter image description here

Then it automatically shows you the real potential for columns reordering on all of your tables:

enter image description here

answered Sep 19 '22 14:09

Алексей Лещук

Related questions
                            
                                Strange PostgreSQL "value too long for type character varying(500)"
                            
                                How to configure postgresql postgresql.conf listen_addresses for multiple ip addresses [closed]
                            
                                Can I automatically create a table in PostgreSQL from a csv file with headers?
                            
                                How to drop all NOT NULL constraints from a PostgreSQL table in one go
                            
                                How to check if an array is empty in Postgres
                            
                                how to execute a .sql script on heroku?
                            
                                Using multiple PostgreSQL schemas with Rails models
                            
                                Installing PostgreSQL on Ubuntu for Ruby on Rails
                            
                                Postgres "psql not recognized as an internal or external command"
                            
                                MySQL vs PostgreSQL? Which should I choose for my Django project?
                            
                                Postgresql: Check if Schema Exists?
                            
                                Get execution time of PostgreSQL query
                            
                                Entity framework PostgreSQL
                            
                                Switch role after connecting to database
                            
                                What is the datatype for a password in PostgreSQL?
                            
                                PostgreSQL: Create schema in specific database
                            
                                PostgreSQL: using a calculated column in the same query
                            
                                How to Extract Year from DATE in POSTGRESQL
                            
                                Best way to get result count before LIMIT was applied
                            
                                Multiple CTE in single query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating and saving space in PostgreSQL

Tags:

postgresql

database-design

storage

bigdata