I have date range data in SQL DB table that has these three (only relevant) columns:
ID
(int identity)RangeFrom
(date only)RangeTo
(date only)For any given date range, there may be an arbitrary number of records that may overlap (completely or partially).
ID
(newer record) takes precedence over older records that it may overlap (fully or partially)RangeFrom
and RangeTo
differ by one day)Since there's a lot of complex data related to these ranges (lots of joins etc etc) and since processor + memory power is much more efficient than SQL DB engine I decided to rather load overlapping data from DB to my data layer and do the range chopping/splitting in memory. This give me much more flexibility as well as speed in terms of development and execution.
If you think this should be better handled in DB let me know.
I would like to write the fastest and if at all possible also resource non-hungry conversion algorithm. Since I get lots of these records and they are related to various users I have to run this algorithm for each user and its set of overlapping ranges data.
What would be the most efficient (fast and non resource hungry) way of splitting these overlapping ranges?
I have records ID=1
to ID=5
that visually overlap in this manner (dates are actually irrelevant, I can better show these overlaps this way):
6666666666666
44444444444444444444444444 5555555555
2222222222222 333333333333333333333 7777777
11111111111111111111111111111111111111111111111111111111111111111111
Result should look like:
111111166666666666664444444444444444444444333333333555555555511111117777777
Result actually looks like as if we'd be looking at these overlaps from the top and then get IDs that we see from this top-down view.
Result will actually get transformed into new range records, so old IDs become irrelevant. But their RangeFrom
and RangeTo
values (along with all related data) will be used:
111111122222222222223333333333333333333333444444444555555555566666667777777
This is of course just an example of overlapping ranges. It can be anything from 0 records to X for any given date range. And as we can see range ID=2 got completely overwritten by 4 and 6 so it became completely obsolete.
Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.
You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).
If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.
I've come up with an idea of my own:
for the given date range I would create an in memory array of integers with as many items as there are days in the range.
fill array with null
values. All of them.
order records by ID in reverse order
flatten overlapped ranges by iterating over ordered records and do the following on each item:
null
you end up with an array of flattened ranges and filled with record IDs
create new set of records and create each new record when ID in array changes. Each record should use data associated with the record ID as set in array
Repeat the whole thing for next person and its set of overlapped ranges (don't forget to reuse the same array). = go back to step 2.
And that's it basically.
A 10 years given date range requires an array of approx. 3650 nullable integers, which I think is rather small memory footprint (each integer taking 4 bytes, but I don't know how much space occupies a nullable integer that has an int
and bool
but lets assume 8 bytes which totals at 3650*8 = 28.52k) and can be easily and rather fast manipulate in memory. Since I'm not saving date ranges, splitting or anything similar these are barely just assignment operations with an if that checks whether value has already been set.
A 10 year date range is a rare exaggeratet extreme. 75% of date ranges will be within 3 months or quarter of a year (90 days * 8 bytes = 720 bytes) and 99% will fall in a range of a whole year (365*8 = 2920 bytes = 2,85k)
I find this algorithm more than appropriate for flattening overlapped date ranges.
To half memory footprint I could use int
instead of int?
and set to -1
instead of null
.
I could as well keep a count of days that aren't set and when it reaches 0 I can easily break the loop, because all remaining ranges are fully overlapped hence they wouldn't set any more values in array. So this would even speed things up a bit when I would have lots of range records (which will be rather rare).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With