Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive SQL statement (PostgreSQL 9.1.4)

PostgreSQL 9.1

Business situation

Every month, there is a new batch of accounts given to a specific process. Every batch can be described by month, number of accounts and total balance of accounts. The goal of the process is to recover some of the balance back from customers. Each batch is than tracked separately on a monthly basis (amount recovered on each month since batch was transferred to the process).

Goal

My goal is to predict what amount will be recovered in the future.

Data definition

create table vintage_data (
    granularity date,       /* Month when account entered process*/
    distance_in_months integer, /* Distance in months from date when accounts entered process*/
    entry_accounts integer,     /* Number of accounts that entered process in a given month*/
    entry_amount numeric,       /* Total amount for account that entered process in a given month*/
    recovery_amount numeric     /* Amount recovered in Nth month on accounts that entered process in a given month */
);

Sample data

insert into vintage_data values('2012-01-31',1,200,100000,1000);
insert into vintage_data values('2012-01-31',2,200,100000,2000);
insert into vintage_data values('2012-01-31',3,200,100000,3000);
insert into vintage_data values('2012-01-31',4,200,100000,3500);
insert into vintage_data values('2012-01-31',5,200,100000,3400);
insert into vintage_data values('2012-01-31',6,200,100000,3300);
insert into vintage_data values('2012-02-28',1,250,150000,1200);
insert into vintage_data values('2012-02-28',2,250,150000,1600);
insert into vintage_data values('2012-02-28',3,250,150000,1800);
insert into vintage_data values('2012-02-28',4,250,150000,1200);
insert into vintage_data values('2012-02-28',5,250,150000,1600);
insert into vintage_data values('2012-03-31',1,200,90000,1300);
insert into vintage_data values('2012-03-31',2,200,90000,1200);
insert into vintage_data values('2012-03-31',3,200,90000,1400);
insert into vintage_data values('2012-03-31',4,200,90000,1000);
insert into vintage_data values('2012-04-30',1,300,180000,1600);
insert into vintage_data values('2012-04-30',2,300,180000,1500);
insert into vintage_data values('2012-04-30',3,300,180000,4000);
insert into vintage_data values('2012-05-31',1,400,225000,2200);
insert into vintage_data values('2012-05-31',2,400,225000,6000);
insert into vintage_data values('2012-06-30',1,100,60000,1000);

Calculation Process

You can imagine the data as a triangular matrix (X values are to be forecasted):

distance_in_months                       1      2     3       4      5      6
granularity entry_accounts  entry_amount
2012-01-31  200             100000       1000   2000   3000   3500   3400   3300
2012-02-28  250             150000       1200   1600   1800   1200   1600   (X-1)
2012-03-31  200              90000       1300   1200   1400   1000   (X0)   (X4)
2012-04-30  300             180000       1600   1500   4000   (X1)   (X5)   (X8)
2012-05-31  400             225000       2200   6000   (X2)   (X6)   (X9)   (X11)
2012-06-30  100              60000       1000   (X3)   (X7)   (X10)  (X12   (X13)

Algorithm

The goal I have is to forecast all the missing points (future). To illustrate the process, this is the calculation for the point X1

1) Get row totals for previous three months using distance up to 4:

2012-01-31  1000+2000+3000+3500=9500 (d4m3)
2012-02-28  1200+1600+1800+1200=5800 (d4m2)
2012-03-31  1300+1200+1400+1000=4900 (d4m1)

2) Get row totals for previous three months using distance up to 3:

2012-01-31  1000+2000+3000=6000 (d3m3)
2012-02-28  1200+1600+1800=4600 (d3m2)
2012-03-31  1300+1200+1400=3800 (d3m1)

3) Calculate weighted average running rate for distance 3 and distance 4 (weighted by entry_amount):

(d4m3+d4m2+d4m1)/(100000+150000+90000) = (9500+5800+4900)/(100000+150000+90000) = 20200/340000 = 0.0594
(d3m3+d3m2+d3m1)/(100000+150000+90000) = (6000+4600+3800)/(100000+150000+90000) = 14400/340000 = 0.0424

4) Calculate the change between distance 3 and distance 4

((d4m3+d4m2+d4m1)/(100000+150000+90000))/((d3m3+d3m2+d3m1)/(100000+150000+90000)) =
= (20200/340000)/(14400/340000) =
= 0.0594/0.0424 = 1.403 (PredictionRateForX1)

5) Calculate row totals for predicted month using distance up to 3:

2012-04-30  1600+1500+4000=7100

6) Calculate rate using entry_amount for predicted month

7100/180000 = 0.0394

7) Calculate rate predicted for X1

0.0394 * PredictionRateForX1 = 0.05534

8) Calculate amount for X1

(0.05534-0.0394)*180000 = 2869.2

Problem

The problem is how to calculate the rest of the matrix (from x-1 to x13) using SQL statement. It is obvious that this will require some sort of recursive algorithm.

like image 318
Tomas Greif Avatar asked Jul 19 '12 09:07

Tomas Greif


2 Answers

It's a big task, split it up to make it more manageable. I would put that in a plpgsql function with RETURN TABLE:

  1. Create a temporary table for your "Calculation Process" matrix using a crosstab query You need the tablefunc module installed for that. Run (once per database):

    CREATE EXTENSION tablefunc;
    
  2. Update the temp table field by field.

  3. Return table.

The following demo is fully functional and tested with PostgreSQL 9.1.4. Building on the table definition provided in the question:

-- DROP FUNCTION f_forcast();

CREATE OR REPLACE FUNCTION f_forcast()
  RETURNS TABLE (
  granularity date
 ,entry_accounts numeric
 ,entry_amount numeric
 ,d1 numeric
 ,d2 numeric
 ,d3 numeric
 ,d4 numeric
 ,d5 numeric
 ,d6 numeric) AS
$BODY$
BEGIN

--== Create temp table with result of crosstab() ==--

CREATE TEMP TABLE matrix ON COMMIT DROP AS
SELECT *
FROM   crosstab (
        'SELECT granularity, entry_accounts, entry_amount
               ,distance_in_months, recovery_amount
         FROM   vintage_data
         ORDER  BY 1, 2',

        'SELECT DISTINCT distance_in_months
         FROM   vintage_data
         ORDER  BY 1')
AS tbl (
  granularity date
 ,entry_accounts numeric
 ,entry_amount numeric
 ,d1 numeric
 ,d2 numeric
 ,d3 numeric
 ,d4 numeric
 ,d5 numeric
 ,d6 numeric
 );

ANALYZE matrix; -- update statistics to help calculations


--== Calculations ==--

-- I implemented the first calculation for X1 and leave the rest to you.
-- Can probably be generalized in a loop or even a single statement.

UPDATE matrix m
SET    d4 = (
    SELECT (sum(x.d1) + sum(x.d2) + sum(x.d3) + sum(x.d4))
            /(sum(x.d1) + sum(x.d2) + sum(x.d3)) - 1
            -- removed redundant sum(entry_amount) from equation
    FROM  (
        SELECT *
        FROM   matrix a
        WHERE  a.granularity < m.granularity
        ORDER  BY a.granularity DESC
        LIMIT  3
        ) x
    ) * (m.d1 + m.d2 + m.d3)
WHERE m.granularity = '2012-04-30';

--- Next update X2 ..


--== Return results ==--

RETURN QUERY
TABLE  matrix
ORDER  BY 1;

END;
$BODY$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_forcast();

I have simplified quite a bit, removing some redundant steps in the calculation.
The solution employs a variety of advanced techniques. You need to know your way around PostgreSQL to work with this.

like image 142
Erwin Brandstetter Avatar answered Nov 01 '22 02:11

Erwin Brandstetter


        --
        -- rank the dates.
        -- , also fetch the the fields that seem to depend on them.
        -- (this should have been done in the data model)
        --
CREATE VIEW date_rank AS (
        SELECT uniq.granularity,uniq.entry_accounts,uniq.entry_amount
        , row_number() OVER(ORDER BY 0) AS zrank
        FROM ( SELECT DISTINCT granularity, entry_accounts, entry_amount FROM vintage_data)
             AS uniq
        );

-- SELECT * FROM date_rank ORDER BY granularity;
        --
        -- transform to an x*y matrix, avoiding the date key and the slack columns
        --
CREATE VIEW matrix_data AS (
        SELECT vd.distance_in_months AS xxx
        , dr.zrank AS yyy
        , vd.recovery_amount AS val
        FROM vintage_data vd
        JOIN date_rank dr ON dr.granularity = vd.granularity
        );
-- SELECT * FROM matrix_data;

        --
        -- In order to perform the reversed transformation:
        -- make the view insertable.
        -- INSERTS to matrix_data will percolate back into the vintage_data table
        -- (don't try this at home ;-)
        --
CREATE RULE magic_by_the_plasser AS
        ON INSERT TO matrix_data
        DO INSTEAD (
        INSERT INTO vintage_data (granularity,distance_in_months,entry_accounts,entry_amount,recovery_amount)
        SELECT dr.granularity, new.xxx, dr.entry_accounts, dr.entry_amount, new.val
        FROM date_rank dr
        WHERE dr.zrank = new.yyy
                ;
        );

        --
        -- This CTE creates the weights for a Pascal-triangle
        --
-- EXPLAIN -- ANALYZE
WITH RECURSIVE pascal AS (
        WITH empty AS (
                --
                -- "cart" is a cathesian product of X*Y
                -- its function is similar to a "calendar table":
                -- filling in the missing X,Y pairs, making the matrix "square".
                -- (well: rectangular, but in the given case nX==nY)
                --
                WITH cart AS (
                        WITH mmx AS (
                                WITH xx AS ( SELECT MIN(xxx) AS x0 , MAX(xxx) AS x1 FROM matrix_data)
                                SELECT generate_series(xx.x0,xx.x1) AS xxx
                                FROM xx
                                )
                        , mmy AS (
                                WITH yy AS ( SELECT MIN(yyy) AS y0 , MAX(yyy) AS y1 FROM matrix_data)
                                SELECT generate_series(yy.y0,yy.y1) AS yyy
                                FROM yy
                                )
                        SELECT * FROM mmx
                        JOIN mmy ON (1=1) -- Carthesian product here!
                        )
                --
                -- The (x,y) pairs that are not present in the current matrix
                --
                SELECT * FROM cart ca
                WHERE NOT EXISTS (
                        SELECT *
                        FROM matrix_data nx
                        WHERE nx.xxx = ca.xxx
                        AND nx.yyy = ca.yyy
                        )
                )
        SELECT md.yyy AS src_y
                , md.xxx AS src_x
                , md.yyy AS dst_y
                , md.xxx AS dst_x
                -- The filled-in matrix cells have weight 1
                , 1::numeric AS weight
        FROM matrix_data md
        UNION ALL
        SELECT pa.src_y AS src_y
                , pa.src_x AS src_x
                , em.yyy AS dst_y
                , em.xxx AS dst_x
                -- the derived matrix cells inherit weight/2 from both their parents
                , (pa.weight/2) AS weight
        FROM pascal pa
        JOIN empty em
                ON ( em.yyy = pa.dst_y+1 AND em.xxx = pa.dst_x)
                OR ( em.yyy = pa.dst_y AND em.xxx = pa.dst_x+1 )
        )
INSERT INTO matrix_data(yyy,xxx,val)
SELECT pa.dst_y,pa.dst_x
        ,SUM(ma.val*pa.weight)
FROM pascal pa
JOIN matrix_data ma ON pa.src_y = ma.yyy AND pa.src_x = ma.xxx
        -- avoid the filled-in matrix cells (which map to themselves)
WHERE NOT (pa.src_y = pa.dst_y AND pa.src_x = pa.dst_x)
GROUP BY pa.dst_y,pa.dst_x
        ;

        --
        -- This will also get rid of the matrix_data view and the rule.
        --
DROP VIEW date_rank CASCADE;
-- SELECT * FROM matrix_data ;

SELECT * FROM vintage_data ORDER BY granularity, distance_in_months;

RESULT:

NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "vintage_data_pkey" for table "vintage_data"
CREATE TABLE
NOTICE:  ALTER TABLE / ADD UNIQUE will create implicit index "mx_xy" for table "vintage_data"
ALTER TABLE
INSERT 0 21
VACUUM
CREATE VIEW
CREATE VIEW
CREATE RULE
INSERT 0 15
NOTICE:  drop cascades to view matrix_data
DROP VIEW
 granularity | distance_in_months | entry_accounts | entry_amount |      recovery_amount      
-------------+--------------------+----------------+--------------+---------------------------
 2012-01-31  |                  1 |            200 |       100000 |                      1000
 2012-01-31  |                  2 |            200 |       100000 |                      2000
 2012-01-31  |                  3 |            200 |       100000 |                      3000
 2012-01-31  |                  4 |            200 |       100000 |                      3500
 2012-01-31  |                  5 |            200 |       100000 |                      3400
 2012-01-31  |                  6 |            200 |       100000 |                      3300
 2012-02-28  |                  1 |            250 |       150000 |                      1200
 2012-02-28  |                  2 |            250 |       150000 |                      1600
 2012-02-28  |                  3 |            250 |       150000 |                      1800
 2012-02-28  |                  4 |            250 |       150000 |                      1200
 2012-02-28  |                  5 |            250 |       150000 |                      1600
 2012-02-28  |                  6 |            250 |       150000 | 2381.25000000000000000000
 2012-03-31  |                  1 |            200 |        90000 |                      1300
 2012-03-31  |                  2 |            200 |        90000 |                      1200
 2012-03-31  |                  3 |            200 |        90000 |                      1400
 2012-03-31  |                  4 |            200 |        90000 |                      1000
 2012-03-31  |                  5 |            200 |        90000 | 2200.00000000000000000000
 2012-03-31  |                  6 |            200 |        90000 | 2750.00000000000000000000
 2012-04-30  |                  1 |            300 |       180000 |                      1600
 2012-04-30  |                  2 |            300 |       180000 |                      1500
 2012-04-30  |                  3 |            300 |       180000 |                      4000
 2012-04-30  |                  4 |            300 |       180000 | 2500.00000000000000000000
 2012-04-30  |                  5 |            300 |       180000 | 2350.00000000000000000000
 2012-04-30  |                  6 |            300 |       180000 | 2550.00000000000000000000
 2012-05-31  |                  1 |            400 |       225000 |                      2200
 2012-05-31  |                  2 |            400 |       225000 |                      6000
 2012-05-31  |                  3 |            400 |       225000 | 5000.00000000000000000000
 2012-05-31  |                  4 |            400 |       225000 | 3750.00000000000000000000
 2012-05-31  |                  5 |            400 |       225000 | 3050.00000000000000000000
 2012-05-31  |                  6 |            400 |       225000 | 2800.00000000000000000000
 2012-06-30  |                  1 |            100 |        60000 |                      1000
 2012-06-30  |                  2 |            100 |        60000 | 3500.00000000000000000000
 2012-06-30  |                  3 |            100 |        60000 | 4250.00000000000000000000
 2012-06-30  |                  4 |            100 |        60000 | 4000.00000000000000000000
 2012-06-30  |                  5 |            100 |        60000 | 3525.00000000000000000000
 2012-06-30  |                  6 |            100 |        60000 | 3162.50000000000000000000
(36 rows)
like image 28
wildplasser Avatar answered Nov 01 '22 01:11

wildplasser