Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding number of concurrent events given start and end times

I have a massive (~109) set of events characterized by a start and end time. Given a time, I want to find how many on those events were ongoing at that time.

What sort of data structure would be helpful in this situation? The operations I need to be fast are:

  1. Inserting new events, e.g., {start: 100000 milliseconds, end: 100010 milliseconds}.
  2. Querying the number of concurrent events at a given time.

Update: Someone put a computational geometry flag on this, so I figure I should rephrase this in terms of computational geometry. I have a set of 1-dimensional intervals and I want to calculate how many of those intervals intersect with a given point. Insertion of new intervals must be fast.

like image 764
Snowball Avatar asked May 16 '12 00:05

Snowball


3 Answers

You're looking for an interval tree.

  • Construction: O(n log n), where n is the number of intervals
  • Query: O(m+log n), where m is the number of query results and n is the number of intervals
  • Space: O(n)
like image 189
tskuzzy Avatar answered Sep 19 '22 06:09

tskuzzy


Just to add to the other answers, depending on the length of time and the granularity desired, you could simply have an array of counters. For example, if the length of time is 24 hours and the desired granularity is 1 millisecond, there will be 86,400,000 cells in the array. With one 4 byte int per cell (which is enough to hold 10^9), that will be less than 700 MB of RAM, versus tree-based solutions which would take at least (8+8+4+4)*10^9 = 24 GB of RAM for two pointers plus two ints per tree node (since 32 bits of addressable memory is insufficient, you'd need 64 bits per pointer). You can use swap, but this will slow down some queries considerably.

You can also use this solution if you only care about the last 24 hours of data, for example, by using the array as a circular buffer. Besides the limitation on time and granularity, the other downside is that insertion time of an interval is proportional to the length of the interval, so if interval length is unbounded, you could be in trouble. Queries, on the other hand, are a single array lookup.

like image 25
Martin Hock Avatar answered Sep 19 '22 06:09

Martin Hock


(Extending the answers by tskuzzy and Snowball)

A balanced binary search tree makes sense, except that the memory requirements would be excessive for your data set. A B-tree would be much more memory efficient, albeit more complicated unless you can use a library.

Keep two trees, one of start times and one of end times. To insert an event, add the start time to the tree of start times and the end time to the tree of end times. To query the number of active events at time T, search the the start-time tree to find out how many start times are less than T, and search the end-time tree to find out how many end times are less than T. Subtract the number of end times from the number of start times, and that's the number of active events.

Insertions and queries should both take O(log N) time.

A few comments:

  • The way you have phrased the question, you only care about the number of active events, not which events were active. This means you do not need to keep track of which start time goes with which end time! This also makes it easier to avoid the "+M" term in the queries cited by previous answers.

  • Be careful about the exact semantics of your query. In particular, does an event count as active at time T if it starts at time T? If it ends at time T? The anwers to these questions affect whether you use < or <= in certain places.

  • Do not use a "set" data structure, because you almost certainly want to allow and count duplicates. That is, more than one event might start and/or end at the same time. A set would typically ignore duplicates. What you are looking for instead is a "multiset" (sometimes called a "bag").

  • Many binary search trees do not support "number of elements < T" queries out of the box. But it is easy to add this functionality by storing a size at each node.

like image 40
Chris Okasaki Avatar answered Sep 19 '22 06:09

Chris Okasaki