Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to split overlapping date ranges

I have date range data in SQL DB table that has these three (only relevant) columns:

  • ID (int identity)
  • RangeFrom (date only)
  • RangeTo (date only)

For any given date range, there may be an arbitrary number of records that may overlap (completely or partially).

Conditions

  1. Every record with higher ID (newer record) takes precedence over older records that it may overlap (fully or partially)
  2. Ranges are at least 1 day long (RangeFrom and RangeTo differ by one day)

So for a given date range (not longer than ie. 5 years) I have to

  1. get all range records that fall into this range (either fully or partially)
  2. split these overlaps into non-overlapping ranges
  3. return these new non overlapping ranges

My take on it

Since there's a lot of complex data related to these ranges (lots of joins etc etc) and since processor + memory power is much more efficient than SQL DB engine I decided to rather load overlapping data from DB to my data layer and do the range chopping/splitting in memory. This give me much more flexibility as well as speed in terms of development and execution.

If you think this should be better handled in DB let me know.

Question

I would like to write the fastest and if at all possible also resource non-hungry conversion algorithm. Since I get lots of these records and they are related to various users I have to run this algorithm for each user and its set of overlapping ranges data.

What would be the most efficient (fast and non resource hungry) way of splitting these overlapping ranges?

Example data

I have records ID=1 to ID=5 that visually overlap in this manner (dates are actually irrelevant, I can better show these overlaps this way):

       6666666666666
                44444444444444444444444444         5555555555
          2222222222222            333333333333333333333            7777777
11111111111111111111111111111111111111111111111111111111111111111111

Result should look like:

111111166666666666664444444444444444444444333333333555555555511111117777777

Result actually looks like as if we'd be looking at these overlaps from the top and then get IDs that we see from this top-down view.

Result will actually get transformed into new range records, so old IDs become irrelevant. But their RangeFrom and RangeTo values (along with all related data) will be used:

111111122222222222223333333333333333333333444444444555555555566666667777777

This is of course just an example of overlapping ranges. It can be anything from 0 records to X for any given date range. And as we can see range ID=2 got completely overwritten by 4 and 6 so it became completely obsolete.

like image 679
Robert Koritnik Avatar asked Apr 19 '11 06:04

Robert Koritnik


People also ask

How do you calculate overlapping time ranges?

Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

How do you know if two date ranges overlap in SQL?

You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).

What are overlapping ranges?

If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.


1 Answers

How about an array of nullable integers

I've come up with an idea of my own:

  1. for the given date range I would create an in memory array of integers with as many items as there are days in the range.

  2. fill array with null values. All of them.

  3. order records by ID in reverse order

  4. flatten overlapped ranges by iterating over ordered records and do the following on each item:

    1. get item
    2. calculate start and end offset for array (days difference)
    3. set all array values between these two offsets to item ID but only when value is null
    4. continue to step 4.1
  5. you end up with an array of flattened ranges and filled with record IDs

  6. create new set of records and create each new record when ID in array changes. Each record should use data associated with the record ID as set in array

  7. Repeat the whole thing for next person and its set of overlapped ranges (don't forget to reuse the same array). = go back to step 2.

And that's it basically.

A 10 years given date range requires an array of approx. 3650 nullable integers, which I think is rather small memory footprint (each integer taking 4 bytes, but I don't know how much space occupies a nullable integer that has an int and bool but lets assume 8 bytes which totals at 3650*8 = 28.52k) and can be easily and rather fast manipulate in memory. Since I'm not saving date ranges, splitting or anything similar these are barely just assignment operations with an if that checks whether value has already been set.

A 10 year date range is a rare exaggeratet extreme. 75% of date ranges will be within 3 months or quarter of a year (90 days * 8 bytes = 720 bytes) and 99% will fall in a range of a whole year (365*8 = 2920 bytes = 2,85k)

I find this algorithm more than appropriate for flattening overlapped date ranges.

To half memory footprint I could use int instead of int? and set to -1 instead of null.

A premature iteration loop break possibility

I could as well keep a count of days that aren't set and when it reaches 0 I can easily break the loop, because all remaining ranges are fully overlapped hence they wouldn't set any more values in array. So this would even speed things up a bit when I would have lots of range records (which will be rather rare).

like image 142
Robert Koritnik Avatar answered Sep 29 '22 09:09

Robert Koritnik