I have an updates
table in Postgres is 9.4.5 like this:
goal_id | created_at | status
1 | 2016-01-01 | green
1 | 2016-01-02 | red
2 | 2016-01-02 | amber
And a goals
table like this:
id | company_id
1 | 1
2 | 2
I want to create a chart for each company that shows the state of all of their goals, per week.
I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.
What I have so far:
SELECT EXTRACT(year from generate_series) AS year,
EXTRACT(week from generate_series) AS week,
u.company_id,
COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
SELECT DISTINCT ON(year, week)
goals.company_id,
updates.status,
EXTRACT(week from updates.created_at) week,
EXTRACT(year from updates.created_at) AS year,
updates.created_at
FROM updates
JOIN goals ON goals.id = updates.goal_id
ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
But this has two problems. It seems that the join on u
isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.
This is some pretty complicated SQL and I love some input on how to pull it off.
The goals table has around ~1000 goals ATM and is growing about ~100 a week:
Table "goals"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('goals_id_seq'::regclass)
company_id | integer | not null
name | text | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goals_pkey" PRIMARY KEY, btree (id)
"entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
"goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
The updates
table has around ~1000 and is growing around ~100 a week:
Table "updates"
Column | Type | Modifiers
------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default nextval('updates_id_seq'::regclass)
status | entity.goalstatus | not null
goal_id | integer | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goal_updates_pkey" PRIMARY KEY, btree (id)
"entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
"updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
entity | entity.goalstatus | goalstatus | 4 | green +| |
| | | | amber +| |
| | | | red | |
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
And simplify date handling with date_trunc()
.
SELECT w_start
, g.company_id
, count(*) FILTER (WHERE u.status = 'green') AS green_count
, count(*) FILTER (WHERE u.status = 'amber') AS amber_count
, count(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(date_trunc('week', NOW() - interval '2 months')
, date_trunc('week', NOW())
, interval '1 week') w_start
CROSS JOIN goals g
LEFT JOIN LATERAL (
SELECT status
FROM updates
WHERE goal_id = g.id
AND created_at < w_start
ORDER BY created_at DESC
LIMIT 1
) u ON true
GROUP BY w_start, g.company_id
ORDER BY w_start, g.company_id;
To make this fast you need a multicolumn index:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
Index columns in that order. Why?
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
EXTRACT(isoyear from w_start) AS year
, EXTRACT(week from w_start) AS week
Best with ISOYEAR
, like @Paul explained.
SQL Fiddle.
Related:
This seems like a good use for LATERAL
joins:
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
WHERE u2.created_at <= s.w
ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u
ON true
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;
Btw I am extracting ISOYEAR
not YEAR
to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is 2016
but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is 53
!
EDIT: You should test on your real data, but I feel like this ought to be faster:
SELECT year,
week,
company_id,
COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM (
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
u.goal_id,
(array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
LEFT OUTER JOIN (
SELECT g.company_id, u2.goal_id, u2.created_at, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
) u
ON s.t >= u.created_at
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;
Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1]
with a real first
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With