I have a large dataset where I want to sum a count where records have overlapping time. For example, given the data
[
{"id": 1, "name": 'A', "start": '2018-12-10 00:00:00', "end": '2018-12-20 00:00:00', count: 34},
{"id": 2, "name": 'B', "start": '2018-12-16 00:00:00', "end": '2018-12-27 00:00:00', count: 19},
{"id": 3, "name": 'C', "start": '2018-12-16 00:00:00', "end": '2018-12-20 00:00:00', count: 56},
{"id": 4, "name": 'D', "start": '2018-12-25 00:00:00', "end": '2018-12-30 00:00:00', count: 43}
]
You can see there are 2 periods where activities overlap. I want to return the total count of these 'overlaps' based on the activities involved in overlap. So the above would output something like:
[
{start:'2018-12-16', end: '2018-12-20', overlap_ids:[1,2,3], total_count: 109},
{start:'2018-12-25', end: '2018-12-27', overlap_ids:[2,4], total_count: 62},
]
The question is, how to go about generating this via a postgres query? Was looking into generate_series then working out what activity falls into each interval, but thats not quite right as the data is continuous - I really need to identify the exact overlapping time then do a sum on the overlapping activities.
EDIT Have added another example. As @SRack pointed out, since A,B,C overlap, this means B,C A,B and A,C also overlap. This doesn’t matter since the output I’m looking for is an array of date ranges that contain overlapping activities rather than all the unique combinations of overlaps. Also note the dates are timestamps, so will have millisecond precision and won’t necessarily all be at 00:00:00. If it helps, there would probably be a WHERE condition on the total count. For example only want to see results where total count > 100
You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).
To calculate the number of days that overlap in two date ranges, you can use basic date arithmetic, together with the the MIN and MAX functions. Excel dates are just serial numbers, so you can calculate durations by subtracting the earlier date from the later date.
OVERLAPS provides functionally equivalent to the following: EXISTS ( expression INTERSECT expression ) OVERLAPS is one of the Entity SQL set operators. All Entity SQL set operators are evaluated from left to right. For precedence information for the Entity SQL set operators, see EXCEPT.
Postgres date range type There are many different range types in Postgresql and daterange is one of the types that represent the range of date. Let' view the records of employees whose hire date range between 1985-11-21 and 1989-06-02. SELECT * FROM employee WHERE '[1985-11-21, 1989-06-02]'::daterange @> hire_date.
demo:db<>fiddle (uses the old data set with the overlapping A-B-part)
Disclaimer: This works for day intervals not for timestamps. The requirement for ts came later.
SELECT
s.acts,
s.sum,
MIN(a.start) as start,
MAX(a.end) as end
FROM (
SELECT DISTINCT ON (acts)
array_agg(name) as acts,
SUM(count)
FROM
activities, generate_series(start, "end", interval '1 day') gs
GROUP BY gs
HAVING cardinality(array_agg(name)) > 1
) s
JOIN activities a
ON a.name = ANY(s.acts)
GROUP BY s.acts, s.sum
generate_series
generates all dates between start and end. So every date an activity exists gets one row with the specific count
HAVING
filters out the dates where only one activity existDISTINCT ON
Here's a version for timestamps:
demo:db<>fiddle
WITH timeslots AS (
SELECT * FROM (
SELECT
tsrange(timepoint, lead(timepoint) OVER (ORDER BY timepoint)),
lead(timepoint) OVER (ORDER BY timepoint) -- 2
FROM (
SELECT
unnest(ARRAY[start, "end"]) as timepoint -- 1
FROM
activities
ORDER BY timepoint
) s
)s WHERE lead IS NOT NULL -- 3
)
SELECT
GREATEST(MAX(start), lower(tsrange)), -- 6
LEAST(MIN("end"), upper(tsrange)),
array_agg(name), -- 5
sum(count)
FROM
timeslots t
JOIN activities a
ON t.tsrange && tsrange(a.start, a.end) -- 4
GROUP BY tsrange
HAVING cardinality(array_agg(name)) > 1
The main idea is to identify possible time slots. So I take every known time (both start and end) and put them into a sorted list. So I can take the first tow known times (17:00 from start A and 18:00 from start B) and check which interval is in it. Then I check it for the 2nd and 3rd, then for 3rd an 4th and so on.
In the first timeslot only A fits. In the second from 18-19 also B is fitting. In the next slot 19-20 also C, from 20 to 20:30 A isn't fitting anymore, only B and C. The next one is 20:30-22 where only B fits, finally 22-23 D is added to B and last but not least only D fits into 23-23:30.
So I take this time list and join it agains the activities table where the intervals intersect. After that its only a grouping by time slot and sum up your count.
unnest
. So I get all times into one column which can be simply orderedtsrange
NULL
value which is interpreted by tsrange
as infinity. So this would create an incredible wrong time slot. So we need to filter this row out.&&
operator checks if two range types overlap.HAVING
clauselower
). E.g. Take the 20-20:30 slot: It begins 20h but neither B nor C has its starting point there. Similar the end time.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With