Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL partitioning with joined table - partition constraint not used in query plan

I have a large table in PostgreSQL 9.2 that I've partitioned as described in the manual. Well... almost! My real partition key is not in the partitioned table itself, but in a joined table, like this (simplified):

-- millions to tens of millions of rows
CREATE TABLE data
(
  slice_id integer NOT NULL,
  point_id integer NOT NULL,
  -- ... data columns ...,
  CONSTRAINT pk_data PRIMARY KEY (slice_id, point_id),
  CONSTRAINT fk_data_slice FOREIGN KEY (slice_id) REFERENCES slice (id)
  CONSTRAINT fk_data_point FOREIGN KEY (point_id) REFERENCES point (id)
)

-- hundreds to thousands of rows
CREATE TABLE slice
(
  id serial NOT NULL,
  partition_date timestamp without time zone NOT NULL,
  other_date timestamp without time zone NOT NULL,
  int_key integer NOT NULL
  CONSTRAINT pk_slice PRIMARY KEY (id)
)

-- about 40,000 rows
CREATE TABLE point
(
  -- ... similar to "slice" ...
)

The table to be partitioned (data) contains rows for every combination of point and slice, each of which has a compound key. I want to partition it on only one of the key columns, partition_date, which is part of slice. Of course, the check constraints on my child tables cannot include that directly, so instead I include the range of all slice.id values corresponding to that partition_date, like this:

ALTER TABLE data_part_123 ADD CONSTRAINT ck_data_part_123
    CHECK (slice_id >= 1234 AND slice_id <= 1278);

This all works fine for inserting data. However, queries do not use the above CHECK constraint. Eg.

SELECT *
FROM data d
JOIN slice s ON d.slice_id = s.id
WHERE s.partition_date = '2013-07-23'

I can see in the query plan that this still scans all child tables. I've tried rewriting the query in several ways, including a CTE and a sub-select, but that hasn't helped.

Is there any way I can get the planner to "understand" my partitioning scheme? I don't really want to duplicate the partition key millions of times in the data table.

Query plan looks like this:

Aggregate  (cost=539243.88..539243.89 rows=1 width=0)
  ->  Hash Join  (cost=8.88..510714.02 rows=11411945 width=0)
        Hash Cond: (d.slice_id = s.id)
        ->  Append  (cost=0.00..322667.41 rows=19711542 width=4)
              ->  Seq Scan on data d  (cost=0.00..0.00 rows=1 width=4)
              ->  Seq Scan on data_part_123 d  (cost=0.00..135860.10 rows=8299610 width=4)
              ->  Seq Scan on data_part_456 d  (cost=0.00..186807.31 rows=11411931 width=4)
        ->  Hash  (cost=7.09..7.09 rows=143 width=4)
              ->  Seq Scan on slice s  (cost=0.00..7.09 rows=143 width=4)
                    Filter: (partition_date = '2013-07-23 00:00:00'::timestamp without time zone)
like image 753
EM0 Avatar asked Feb 16 '23 20:02

EM0


1 Answers

The only way to achieve it is to make the query dynamic:

create function select_from_data (p_date date)
returns setof data as $function$

declare
    min_slice_id integer,
    max_slice_id integer;

begin
    select min(slice_id), max(slice_id)
    into min_slice_id, max_slice_id
    from slice
    where partition_date = p_date;

return query execute
    $dynamic$
        select *
        from data
        where slice_id between $1 and $2
    $dynamic$
    using min_slice_id, max_slice_id;

end;
$function$ language plpgsql;

This will build the query with the appropriate slice range for the given date and will plan it at run time when the planner will have the information it needs to check for the exact partitions.

To make the function more generic without loosing the planner's ability to get information at running time use the or parameter is null construct in the filter.

create function select_from_data (
    p_date date,
    value_1 integer default null,
    value_2 integer default null
)
returns setof data as $function$

declare
    min_slice_id integer,
    max_slice_id integer;

begin
    select min(slice_id), max(slice_id)
    into min_slice_id, max_slice_id
    from slice
    where partition_date = p_date;

return query execute
    $dynamic$
        select *
        from data
        where
            slice_id between $1 and $2
            and (some_col = $3 or $3 is null)
            and (another_col = $4 or $4 is null)
    $dynamic$
    using min_slice_id, max_slice_id, value_1, value_2;

end;
$function$ language plpgsql;

Now if some parameter was passed as null it will not interfere with the query.

like image 67
Clodoaldo Neto Avatar answered Feb 18 '23 09:02

Clodoaldo Neto