I have a massive (~109) set of events characterized by a start and end time. Given a time, I want to find how many on those events were ongoing at that time.
What sort of data structure would be helpful in this situation? The operations I need to be fast are:
{start: 100000 milliseconds, end: 100010 milliseconds}
.Update: Someone put a computational geometry flag on this, so I figure I should rephrase this in terms of computational geometry. I have a set of 1-dimensional intervals and I want to calculate how many of those intervals intersect with a given point. Insertion of new intervals must be fast.
You're looking for an interval tree.
O(n log n)
, where n
is the number of intervalsO(m+log n)
, where m
is the number of query results and n
is the number of intervalsO(n)
Just to add to the other answers, depending on the length of time and the granularity desired, you could simply have an array of counters. For example, if the length of time is 24 hours and the desired granularity is 1 millisecond, there will be 86,400,000 cells in the array. With one 4 byte int per cell (which is enough to hold 10^9), that will be less than 700 MB of RAM, versus tree-based solutions which would take at least (8+8+4+4)*10^9 = 24 GB of RAM for two pointers plus two ints per tree node (since 32 bits of addressable memory is insufficient, you'd need 64 bits per pointer). You can use swap, but this will slow down some queries considerably.
You can also use this solution if you only care about the last 24 hours of data, for example, by using the array as a circular buffer. Besides the limitation on time and granularity, the other downside is that insertion time of an interval is proportional to the length of the interval, so if interval length is unbounded, you could be in trouble. Queries, on the other hand, are a single array lookup.
(Extending the answers by tskuzzy and Snowball)
A balanced binary search tree makes sense, except that the memory requirements would be excessive for your data set. A B-tree would be much more memory efficient, albeit more complicated unless you can use a library.
Keep two trees, one of start times and one of end times. To insert an event, add the start time to the tree of start times and the end time to the tree of end times. To query the number of active events at time T, search the the start-time tree to find out how many start times are less than T, and search the end-time tree to find out how many end times are less than T. Subtract the number of end times from the number of start times, and that's the number of active events.
Insertions and queries should both take O(log N) time.
A few comments:
The way you have phrased the question, you only care about the number of active events, not which events were active. This means you do not need to keep track of which start time goes with which end time! This also makes it easier to avoid the "+M" term in the queries cited by previous answers.
Be careful about the exact semantics of your query. In particular, does an event count as active at time T if it starts at time T? If it ends at time T? The anwers to these questions affect whether you use < or <= in certain places.
Do not use a "set" data structure, because you almost certainly want to allow and count duplicates. That is, more than one event might start and/or end at the same time. A set would typically ignore duplicates. What you are looking for instead is a "multiset" (sometimes called a "bag").
Many binary search trees do not support "number of elements < T" queries out of the box. But it is easy to add this functionality by storing a size at each node.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With