Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding consecutive date pairs in SQL

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.

I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.

The data looks something like this:

id     user_id effective_date important_value useless_value
1      1       1/3/2007       3               0
2      1       1/4/2007       3               1
3      1       1/6/2007       NULL            1
4      1       2/1/2007       3               0
5      2       1/5/2007       12              1
6      3       1/1/1899       7               0

With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id=2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.

My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.

WHILE (@@fetch_status <> -1)
BEGIN
  SELECT * FROM History INTO #history WHERE user_id = @UserId

  --return entries to delete
  SELECT h2.id
  INTO #delete_history_ids
  FROM #history h1
  JOIN #history h2 ON
    h1.effective_date < h2.effective_date
    AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
  WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)

  DELETE h1
  FROM History h1
  JOIN #delete_history_ids dh ON
    h1.id = dh.id 

  FETCH NEXT FROM UserCursor INTO @UserId
END 

It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.

Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.

Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.

like image 852
tedders Avatar asked Mar 19 '26 05:03

tedders


1 Answers

OK, I think I figured this one out, excellent question!

First, I made the assumption that the effective_date column will not be duplicated for a user_id. I think it can be modified to work if that is not the case - so let me know if we need to account for that.

The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date. Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.

It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck.

Let me know if you have questions.

select 
    * 
from 
    History tcheck
    inner join History tprev
        on  tprev.[user_id] = tcheck.[user_id]
            and tprev.important_value = tcheck.important_value
            and tprev.effective_date < tcheck.effective_date
    left join History checkbtwn
        on  tcheck.[user_id] = checkbtwn.[user_id]
            and checkbtwn.effective_date < tcheck.effective_date
            and checkbtwn.effective_date > tprev.effective_date
where
    checkbtwn.[user_id] is null
like image 186
Derek Kromm Avatar answered Mar 21 '26 21:03

Derek Kromm