I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Question

Here's a simple example of what I'm trying to do:

CREATE TABLE daily_factors (
    factor_date date,
    factor_value numeric(3,1));

CREATE TABLE customer_date_ranges (
    customer_id int,
    date_from date,
    date_to date);

INSERT INTO
    daily_factors
SELECT
    t.factor_date,
    (random() * 10 + 30)::numeric(3,1)
FROM
    generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);

WITH customer_id AS (
    SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
    SELECT
        customer_id,
        (timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
    FROM
        customer_id)
INSERT INTO
    customer_date_ranges
SELECT
    d.customer_id,
    d.date_from,
    (d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
    date_from d;

So I'm basically making two tables:

a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.

Then I want to add up the factors for each customer in their date range, and take the average value.

SELECT 
    cd.customer_id,
    AVG(df.factor_value) AS average_value
FROM 
    customer_date_ranges cd
    INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
    cd.customer_id;

Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?

The only index I could think of was this one:

CREATE INDEX performance_idx ON daily_factors (factor_date);

It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.

I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:

instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?

Any other suggestions?

jjanes · Accepted Answer

The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Tags:

postgresql

postgresql-performance

Richard Hansell

1 Answers

jjanes

Recent Activity

Donate For Us

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Tags:

postgresql

postgresql-performance

Richard Hansell

1 Answers

jjanes

Related questions

Recent Activity

Donate For Us