Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble wrapping head around complex SQL delete query

Tags:

Situation

My goal is to have a yearly cronjob that deletes certain data from a database based on age. To my disposal I have the powers of Bash and MySQL. I started with writing a bash script but then it struck me that maybe, I could do everything with just a single SQL query.

I'm more a programmer by nature and I haven't had much experience with data structures so that's why I would like some help.

Tables / data structure

The relevant tables and columns for this query are as follows:

Registration:

+-----+-------------------+ | Id  | Registration_date | +-----+-------------------+ |   2 | 2011-10-03        |  |   3 | 2011-10-06        |  |   4 | 2011-10-07        |  |   5 | 2011-10-07        |  |   6 | 2011-10-10        |  |   7 | 2011-10-13        |  |   8 | 2011-10-14        |  |   9 | 2011-10-14        |  |  10 | 2011-10-17        | +-------------------------+  

AssociatedClient:

+-----------+-----------------+ | Client_id | Registration_id | +-----------+-----------------+ |         2 |               2 |  |         3 |               2 |  |         3 |               4 |  |         4 |               5 |  |         3 |               6 |  |         5 |               6 |  |         3 |               8 |  |         8 |               9 |  |         7 |              10 |  +-----------------------------+ 

Client: only Id is relevant here.

As you can see, this is a simple many-to-many relationship. A client can have multiple registrations to his name, and a registration can have multiple clients.

The goal

I need to delete all registrations and client data for clients who have not had a new registration in 5 years. Sounds simple, right?

The tricky part

The data should be kept if any other client on any registration from a specific client has a new registration within 5 years.

So imagine client A having 4 registrations with just him in them, and 1 registration with himself and client B. All 5 registrations are older than 5 years. If client B did not have a new registration in 5 years, everything should be deleted: client A registrations and record. If B did have a new registration within 5 years, all client A data should be kept, including his own old registrations.

What I've tried

Building my query, I got about this far:

DELETE * FROM `Registration` AS Reg WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5 AND      (COUNT(`Id`) FROM `Registration` AS Reg2      WHERE Reg2.`Id` IN (SELECT `Registration_id` FROM `AssociatedClient` AS Clients                          WHERE Clients.`Client_id` IN (SELECT `Client_id` FROM `AssociatedClient` AS Clients2                                                        WHERE Clients2.`Registration_id` IN -- stuck                #I need all the registrations from the clients associated with the first                # (outer) registration here, that are newer than 5 years.      ) = 0 -- No newer registrations from any associated clients 

Please understand that I have very limited experience with SQL. I realise that even what I got so far can be heavily optimised (with joins etc) and may not even be correct.

The reason I got stuck is that the solution I had in mind would work if I could use some kind of loop, and I only just realised that this is not something you easily do in an SQL query of this kind.

Any help

Is much appreciated.

like image 548
MarioDS Avatar asked Feb 26 '13 10:02

MarioDS


2 Answers

Begin by identifying the registrations of the other clients of a registration. Here's a view:

create view groups as  select   a.Client_id        , c.Registration_id from AssociatedClient as a  join AssociatedClient as b on a.Registration_id = b.Registration_id  join AssociatedClient as c on b.Client_id = c.Client_id; 

That gives us:

select Client_id     , min(Registration_id) as first     , max(Registration_id) as last     , count(distinct Registration_id) as regs     , count(*) as pals from  groups  group by Client_id; Client_id   first       last        regs        pals       ----------  ----------  ----------  ----------  ---------- 2           2           8           4           5          3           2           8           4           18         4           5           5           1           1          5           2           8           4           5          7           10          10          1           1          8           9           9           1           1          

You dont' need a view, of course; it's just for convenience. You could just use a virtual table. But inspect it carefully to convince yourself it produces the right range of "pal registrations" for each client. Note that the view does not reference Registration. That's significant because it produces the same results even after we use it to delete from Registration, so we can use it for the second delete statement.

Now we have a list of clients and their "pal registrations". What's the date of each pal's last registration?

select g.Client_id, max(Registration_date) as last_reg from groups as g join Registration as r on g.Registration_id = r.Id group by g.Client_id; g.Client_id  last_reg   -----------  ---------- 2            2011-10-14 3            2011-10-14 4            2011-10-07 5            2011-10-14 7            2011-10-17 8            2011-10-14 

Which ones have a latest date before a time certain?

select g.Client_id, max(Registration_date) as last_reg from groups as g join Registration as r on g.Registration_id = r.Id group by g.Client_id having max(Registration_date) < '2011-10-08'; g.Client_id  last_reg   -----------  ---------- 4            2011-10-07 

IIUC that would mean that client #4 should be deleted, and anything he registered for should be deleted. Registrations would be

select * from Registration where Id in (       select Registration_id from groups as g       where Client_id in (              select g.Client_id             from groups as g join Registration as r             on g.Registration_id = r.Id             group by g.Client_id             having max(Registration_date) < '2011-10-08'       ) ); Id          Registration_date ----------  ----------------- 5           2011-10-07        

And, sure enough, client #4 is in Registration #5, and is the only client subject to deletion by this test.

From there you can work out the delete statements. I think the rule is "delete the client and anything he registered for". If so, I'd probably write the Registration IDs to a temporary table, and write the deletes for both Registration and AssociatedClient by joining to it.

like image 168
James K. Lowden Avatar answered Sep 21 '22 21:09

James K. Lowden


You want to know all registrations that need to be kept. So your first query returns registrations within 5 previous years :

SELECT   Id FROM   Registration WHERE   Registration_date >= '2011-10-08' 

then all registrations with clients related to the previous query :

SELECT   a2.Registration_id as Id FROM   AssociatedClient AS a1   INNER JOIN AssociatedClient AS a2     ON a1.Client_id = a2.Client_id  WHERE   a1.Registration_id IN   (       SELECT       Id     FROM       Registration     WHERE       Registration_date >= '2011-10-08'  ) 

Then you have all registrations that you must not delete by combining the previous queries in an UNION, and you want all clients that are not part of this query :

SELECT   Client_id FROM   AssociatedClient WHERE   Registration_id NOT IN   (     SELECT       Id     FROM       Registration     WHERE       Registration_date >= '2011-10-08'     UNION     SELECT       a2.Registration_id as Id     FROM       AssociatedClient AS a1       INNER JOIN AssociatedClient AS a2         ON a1.Client_id = a2.Client_id      WHERE       a1.Registration_id IN       (           SELECT           Id         FROM           Registration         WHERE           Registration_date >= '2011-10-08'       )   ) 

you can see the results in this SQL fiddle

Then you can delete the lines of clients without registration correspondig to the criterias using the following query :

DELETE FROM   AssociatedClient WHERE   Client_id IN (<previous query>); 

and all registrations not present in AssociatedClient :

DELETE FROM   Registration WHERE   Id NOT IN (SELECT Registration_id FROM AssociatedClient) 
like image 37
psadac Avatar answered Sep 21 '22 21:09

psadac