Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to exclude outdated data from a search in PostgreSQL

I have a table containing the following columns:

  • an integer column named id
  • a text column named value
  • a timestamp column named creation_date

Currently, indexes have been created for the id and value columns.

I must search this table for a given value and want to make search as fast as I can. But I don't really need to look through records that are older than one month. So, ideally I would like to exclude them from the index.

What would be the best way to achieve this:

  1. Perform table partitioning. Only search through the subtable for the appropriate month.
  2. Create a partial index including only the recent records. Recreate it every month.
  3. Something else?

(PS.: "the best solution" means the solution that is the most convenient, fast and easy to maintain)

like image 431
boqapt Avatar asked Oct 04 '22 07:10

boqapt


1 Answers

Partial index

A partial index would be perfect for that, or even a partial multicolumn index. But your condition

don't need to search value in records older than one month

is not stable. The condition of a partial index can only work with literals or IMMUTABLE functions, i.e., constant values. You mention Recreate it every month, but that would not agree with your definition older than one month. You see the difference right?

If you should only need a the current (or last) month, index recreation as well as the query itself become quite a bit simpler!

I'll got with your definition "not older than one month" for the rest of this answer. I had to deal with situations like this before. The following solution worked best for me:

Base your index conditions on a fixed timestamp and use the same timestamp in your queries to convince the query planner it can use the partial index. This kind of partial will stay useful over an extended period of time, only its effectiveness deteriorates as new rows are added and older rows drop out of your time frame. The index will return more and more false positives that an additional WHERE clause has to eliminate from your query. Recreate the index to update its condition.

Given your test table:

CREATE TABLE mytbl (
   value text
  ,creation_date timestamp
);

Create a very simple IMMUTABLE SQL function:

CREATE OR REPLACE FUNCTION f_mytbl_start_ts()
  RETURNS timestamp AS
$func$
SELECT '2013-01-01 0:0'::timestamp
$func$ LANGUAGE sql IMMUTABLE;

Use the function in the condition of the partial index:

CREATE INDEX mytbl_start_ts_idx ON mytbl(value, creation_date)
WHERE (creation_date >= f_mytbl_start_ts());

value comes first. Explanation in this related answer on dba.SE.
Input from @Igor in the comments made me improve my answer. A partial multicolumn index should make ruling out false positives from the partial index faster - it's in the nature of the index condition that it's always increasingly outdated (but still a lot better than not having it).

Query

A query like this will make use of the index and should be perfectly fast:

SELECT value
FROM   mytbl
WHERE  creation_date >= f_mytbl_start_ts()            -- !
AND    creation_date >= (now() - interval '1 month')
AND    value = 'foo';

The only purpose of the seemingly redundant WHERE clause: creation_date >= f_mytbl_start_ts() is to make the query planner use the partial index.

You can drop and recreate function and index manually.

Full automation

Or you can automate it in a bigger scheme with possibly lots of similar tables:

Disclaimer: This is advanced stuff. You need to know what you are doing and consider user privileges, possible SQL injection and locking issues with heavy concurrent load!

This "steering table" receives a line per table in your regime:

CREATE TABLE idx_control (
   tbl text primary key  -- plain, legal table names!
  ,start_ts timestamp
);

I would put all such meta objects in a separate schema.

For our example:

INSERT INTO idx_control(tbl, value)
VALUES ('mytbl', '2013-1-1 0:0');

A "steering table" offers the additional benefit that you have an overview over all such tables and their respective settings in a central place and you can update some or all of them in sync.

Whenever you change start_ts in this table the following trigger kicks in and takes care of the rest:

Trigger function:

CREATE OR REPLACE FUNCTION trg_idx_control_upaft()
  RETURNS trigger AS
$func$
DECLARE
   _idx  text := NEW.tbl || 'start_ts_idx';
   _func text := 'f_' || NEW.tbl || '_start_ts';
BEGIN

-- Drop old idx
EXECUTE format('DROP INDEX IF EXISTS %I', _idx);

-- Create / change function; Keep placeholder with -infinity for NULL timestamp
EXECUTE format('
CREATE OR REPLACE FUNCTION %I()
  RETURNS timestamp AS
$x$
SELECT %L::timestamp
$x$ LANGUAGE SQL IMMUTABLE', _func, COALESCE(NEW.start_ts, '-infinity'));

-- New Index; NULL timestamp removes idx condition:    
IF NEW.start_ts IS NULL THEN 
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)', _idx, NEW.tbl);
ELSE
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)
   WHERE  creation_date >= %I()', _idx, NEW.tbl, _func);
END IF;

RETURN NULL;

END
$func$ LANGUAGE plpgsql;

Trigger:

CREATE TRIGGER upaft
AFTER UPDATE ON idx_control
FOR EACH ROW
WHEN (OLD.start_ts IS DISTINCT FROM NEW.start_ts)
EXECUTE PROCEDURE trg_idx_control_upaft();

Now, a simple UPDATE on the steering table calibrates index and function:

UPDATE idx_control
SET    start_ts = '2013-03-22 0:0'
WHERE  tbl = 'mytbl';

You can run a cron job or call this manually.
Queries using the index don't change.

-> SQLfiddle.
I updated the fiddle with a small test case of 10k rows to demonstrate it works. PostgreSQL will even do an index-only scan for my example query. Won't get any faster than this.

like image 73
Erwin Brandstetter Avatar answered Oct 13 '22 11:10

Erwin Brandstetter