Given each row that represents a task, with start time and end time, how can I calculate the number of running tasks (i.e. that started and not ended) at the time each task starts (including itself) using a window function with COUNT OVER? Is a window function even the right approach?
Example, given table tasks:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
Calculate running_tasks:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With