Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Estimating size of Postgres indexes

I'm trying to get a better understanding of the tradeoffs involved in creating Postgres indexes. As part of that, I'd love to understand how much space indexes usually use. I've read through the docs, but can't find any information on this. I've been doing my own little experiments creating tables and indexes, but it would be amazing if someone could offer an explanation of why the size is what it is. Assume a common table like this with 1M rows, where each row has a unique id and a unique outstanding.

CREATE TABLE account (
    id integer,
    active boolean NOT NULL,
    outstanding double precision NOT NULL,
);

and the indexes created by

  • CREATE INDEX id_idx ON account(id)
  • CREATE INDEX outstanding_idx ON account(outstanding)
  • CREATE INDEX id_outstanding_idx ON account(id, outstanding)
  • CREATE INDEX active_idx ON account(active)
  • CREATE INDEX partial_id_idx ON account(id) WHERE active

What would you estimate the index sizes to be in bytes and more importantly, why?

like image 431
pir Avatar asked Sep 06 '20 04:09

pir


Video Answer


2 Answers

Since you did not specify the index type, I'll assume default B-tree indexes. Other types can be a lot different.

Here is a simplistic function to compute the estimated minimum size in bytes for an index on the given table with the given columns:

CREATE OR REPLACE FUNCTION f_index_minimum_size(_tbl regclass, _cols VARIADIC text[], OUT estimated_minimum_size bigint)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _missing_column text;
BEGIN

-- assert
SELECT i.attname
FROM   unnest(_cols) AS i(attname)
LEFT   JOIN pg_catalog.pg_attribute a ON a.attname = i.attname
                                     AND a.attrelid = _tbl
WHERE  a.attname IS NULL
INTO   _missing_column;

IF FOUND THEN
   RAISE EXCEPTION 'Table % has no column named %', _tbl, quote_ident(_missing_column);
END IF;


SELECT INTO estimated_minimum_size
       COALESCE(1 + ceil(reltuples/trunc((blocksize-page_overhead)/(4+tuple_size)))::int, 0) * blocksize -- AS estimated_minimum_size
FROM  (
   SELECT maxalign, blocksize, reltuples, fillfactor, page_overhead
        , (maxalign  -- up to 16 columns, else nullbitmap may force another maxalign step
         + CASE WHEN datawidth <= maxalign  THEN maxalign
                WHEN datawidth%maxalign = 0 THEN datawidth
                ELSE                            (datawidth + maxalign) - datawidth%maxalign END  -- add padding to the data to align on MAXALIGN
          ) AS tuple_size
   FROM  (
      SELECT c.reltuples, count(*)
           , 90 AS fillfactor
           , current_setting('block_size')::bigint AS blocksize
           , CASE WHEN version() ~ '64-bit|x86_64|ppc64|ia64|amd64|mingw32'  -- MAXALIGN: 4 on 32bits, 8 on 64bits
                  THEN 8 ELSE 4 END AS maxalign
           , 40 AS page_overhead  -- 24 bytes page header + 16 bytes "special space"
           -- avg data width without null values
           , sum(ceil((1-COALESCE(s.null_frac, 0)) * COALESCE(s.avg_width, 1024))::int) AS datawidth  -- ceil() because avg width has a low bias
      FROM   pg_catalog.pg_class     c
      JOIN   pg_catalog.pg_attribute a ON a.attrelid = c.oid
      JOIN   pg_catalog.pg_stats     s ON s.schemaname = c.relnamespace::regnamespace::text
                                      AND s.tablename  = c.relname
                                      AND s.attname    = a.attname
      WHERE  c.oid = _tbl
      AND    a.attname = ANY(_cols) --  all exist, verified above
      GROUP  BY 1
      ) sub1
   ) sub2;
END
$func$;

Call examples:

SELECT f_index_minimum_size('my_table', 'col1', 'col2', 'col3');

SELECT f_index_minimum_size('public.my_table', VARIADIC '{col1, col2, col3}');

db<>fiddle here

About VARIADIC parameters:

  • Return rows matching elements of input array in plpgsql function

Basically, all indexes use data pages of typically 8 kb block size (rarely 4 kb). There is one data page overhead for B-tree indexes to start with. Each additional data page has a fixed overhead of 40 bytes (currently). Each page stores tuples like depicted in the manual here. Each tuple has a tuple header (typically 8 bytes incl. alignment padding), possibly a null bitmap, data (possibly incl. alignment padding between columns for multicolumn indices), and possibly alignment padding to the next multiple of MAXALIGN (typically 8 bytes). Plus, there is an ItemId of 4 bytes per tuple. Some space may be reserved initially for later additions with a fillfactor - 90 % by default for B-tree indexes.

Important notes & disclaimers

The reported size is the estimated minimum size. An actual index will typically be bigger by around 25 % due to natural bloat from page splits. Plus, the calculation does not take possible alignment padding between multiple columns into account. Can add another couple percent (or more in extreme cases). See:

  • Calculating and saving space in PostgreSQL

Estimations are based on column statistics in the view pg_stats which is based on the system table pg_statistics. (Using the latter directly would be faster, but only allowed for superusers.) In particular, the calculation is based on null_frac, the "fraction of column entries that are null" and avg_width, the "average width in bytes of column's entries" to compute an average data width - ignoring possible additional alignment padding for multicolumn indexes.

The default 90 % fillfactor is taken into account. (One might specify a different one.)

Up to 50 % bloat is typically natural for B-tree indexes and nothing to worry about.

Does not work for expression indexes.

No provision for partial indexes.

Function raises an exception if anything but existing plain column names is passed. Case-sensitive!

If the table is new (or in any case if statistics may be out of date), be sure to run ANALYZE on the table before calling the function to update (or even initiate!) statistics.

Due to major optimizations, B-tree indexes in Postgres 12 waste less space and are typically closer to the reported minimum size.

Does not account for deduplication that's introduced with Postgres 13, which can compact indexes with duplicate values.

Parts of the code are taken from ioguix' bloat estimation queries here:

  • https://github.com/ioguix/pgsql-bloat-estimation

More gory details i the Postgres source code here:

  • https://doxygen.postgresql.org/bufpage_8h_source.html
like image 154
Erwin Brandstetter Avatar answered Oct 23 '22 18:10

Erwin Brandstetter


You can calculate it yourself. Each index entry has an overhead of 8 bytes. Add the average size of your indexed data (in the internal binary format).

There is some more overhead, like page header and footer and internal index pages, but that doesn't account for much, unless your index rows are very wide.

like image 34
Laurenz Albe Avatar answered Oct 23 '22 19:10

Laurenz Albe