My goal is to have a yearly cronjob that deletes certain data from a database based on age. To my disposal I have the powers of Bash and MySQL. I started with writing a bash script but then it struck me that maybe, I could do everything with just a single SQL query.
I'm more a programmer by nature and I haven't had much experience with data structures so that's why I would like some help.
Tables / data structure
The relevant tables and columns for this query are as follows:
Registration:
+-----+-------------------+ | Id | Registration_date | +-----+-------------------+ | 2 | 2011-10-03 | | 3 | 2011-10-06 | | 4 | 2011-10-07 | | 5 | 2011-10-07 | | 6 | 2011-10-10 | | 7 | 2011-10-13 | | 8 | 2011-10-14 | | 9 | 2011-10-14 | | 10 | 2011-10-17 | +-------------------------+
AssociatedClient:
+-----------+-----------------+ | Client_id | Registration_id | +-----------+-----------------+ | 2 | 2 | | 3 | 2 | | 3 | 4 | | 4 | 5 | | 3 | 6 | | 5 | 6 | | 3 | 8 | | 8 | 9 | | 7 | 10 | +-----------------------------+
Client: only Id is relevant here.
As you can see, this is a simple many-to-many relationship. A client can have multiple registrations to his name, and a registration can have multiple clients.
I need to delete all registrations and client data for clients who have not had a new registration in 5 years. Sounds simple, right?
The data should be kept if any other client on any registration from a specific client has a new registration within 5 years.
So imagine client A having 4 registrations with just him in them, and 1 registration with himself and client B. All 5 registrations are older than 5 years. If client B did not have a new registration in 5 years, everything should be deleted: client A registrations and record. If B did have a new registration within 5 years, all client A data should be kept, including his own old registrations.
Building my query, I got about this far:
DELETE * FROM `Registration` AS Reg WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5 AND (COUNT(`Id`) FROM `Registration` AS Reg2 WHERE Reg2.`Id` IN (SELECT `Registration_id` FROM `AssociatedClient` AS Clients WHERE Clients.`Client_id` IN (SELECT `Client_id` FROM `AssociatedClient` AS Clients2 WHERE Clients2.`Registration_id` IN -- stuck #I need all the registrations from the clients associated with the first # (outer) registration here, that are newer than 5 years. ) = 0 -- No newer registrations from any associated clients
Please understand that I have very limited experience with SQL. I realise that even what I got so far can be heavily optimised (with joins etc) and may not even be correct.
The reason I got stuck is that the solution I had in mind would work if I could use some kind of loop, and I only just realised that this is not something you easily do in an SQL query of this kind.
Is much appreciated.
Begin by identifying the registrations of the other clients of a registration. Here's a view:
create view groups as select a.Client_id , c.Registration_id from AssociatedClient as a join AssociatedClient as b on a.Registration_id = b.Registration_id join AssociatedClient as c on b.Client_id = c.Client_id;
That gives us:
select Client_id , min(Registration_id) as first , max(Registration_id) as last , count(distinct Registration_id) as regs , count(*) as pals from groups group by Client_id; Client_id first last regs pals ---------- ---------- ---------- ---------- ---------- 2 2 8 4 5 3 2 8 4 18 4 5 5 1 1 5 2 8 4 5 7 10 10 1 1 8 9 9 1 1
You dont' need a view, of course; it's just for convenience. You could just use a virtual table. But inspect it carefully to convince yourself it produces the right range of "pal registrations" for each client. Note that the view does not reference Registration
. That's significant because it produces the same results even after we use it to delete from Registration
, so we can use it for the second delete statement.
Now we have a list of clients and their "pal registrations". What's the date of each pal's last registration?
select g.Client_id, max(Registration_date) as last_reg from groups as g join Registration as r on g.Registration_id = r.Id group by g.Client_id; g.Client_id last_reg ----------- ---------- 2 2011-10-14 3 2011-10-14 4 2011-10-07 5 2011-10-14 7 2011-10-17 8 2011-10-14
Which ones have a latest date before a time certain?
select g.Client_id, max(Registration_date) as last_reg from groups as g join Registration as r on g.Registration_id = r.Id group by g.Client_id having max(Registration_date) < '2011-10-08'; g.Client_id last_reg ----------- ---------- 4 2011-10-07
IIUC that would mean that client #4 should be deleted, and anything he registered for should be deleted. Registrations would be
select * from Registration where Id in ( select Registration_id from groups as g where Client_id in ( select g.Client_id from groups as g join Registration as r on g.Registration_id = r.Id group by g.Client_id having max(Registration_date) < '2011-10-08' ) ); Id Registration_date ---------- ----------------- 5 2011-10-07
And, sure enough, client #4 is in Registration #5, and is the only client subject to deletion by this test.
From there you can work out the delete
statements. I think the rule is "delete the client and anything he registered for". If so, I'd probably write the Registration IDs to a temporary table, and write the deletes for both Registration
and AssociatedClient
by joining to it.
You want to know all registrations that need to be kept. So your first query returns registrations within 5 previous years :
SELECT Id FROM Registration WHERE Registration_date >= '2011-10-08'
then all registrations with clients related to the previous query :
SELECT a2.Registration_id as Id FROM AssociatedClient AS a1 INNER JOIN AssociatedClient AS a2 ON a1.Client_id = a2.Client_id WHERE a1.Registration_id IN ( SELECT Id FROM Registration WHERE Registration_date >= '2011-10-08' )
Then you have all registrations that you must not delete by combining the previous queries in an UNION
, and you want all clients that are not part of this query :
SELECT Client_id FROM AssociatedClient WHERE Registration_id NOT IN ( SELECT Id FROM Registration WHERE Registration_date >= '2011-10-08' UNION SELECT a2.Registration_id as Id FROM AssociatedClient AS a1 INNER JOIN AssociatedClient AS a2 ON a1.Client_id = a2.Client_id WHERE a1.Registration_id IN ( SELECT Id FROM Registration WHERE Registration_date >= '2011-10-08' ) )
you can see the results in this SQL fiddle
Then you can delete the lines of clients without registration correspondig to the criterias using the following query :
DELETE FROM AssociatedClient WHERE Client_id IN (<previous query>);
and all registrations not present in AssociatedClient :
DELETE FROM Registration WHERE Id NOT IN (SELECT Registration_id FROM AssociatedClient)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With