I have been struggling with deleting massive quantities of old data from a database. Each of 5 different tables has as many as 50M rows that need to be deleted. No single delete statement could handle that quantity of data, so I have to loop through deleting a few at a time. My question is to whether there is any noticeable performance increase in looping within a stored procedure instead of looping in the application code. Now for the specifics, I am using DB2 (9.7 CE), and coding in C#. For my stored procedure I use:
--#SET TERMINATOR ;
DROP PROCEDURE myschema.purge_orders_before;
--#SET TERMINATOR @
CREATE PROCEDURE myschema.purge_orders_before (IN before_date TIMESTAMP)
DYNAMIC RESULT SETS 1
P1: BEGIN
DECLARE no_data SMALLINT DEFAULT 0;
DECLARE deadlock_encountered SMALLINT DEFAULT 0;
DECLARE deadlock_condition CONDITION FOR SQLSTATE '40001';
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_data = 1;
-- The deadlock_encountered attribute is throw-away,
-- but a continue handler needs to do something,
-- i.e., it's not enough to just declare a handler,
-- it has to have an action in its body.
DECLARE CONTINUE HANDLER FOR deadlock_condition
SET deadlock_encountered = 1;
WHILE (no_data = 0 ) DO
DELETE FROM
(SELECT 1 FROM myschema.orders WHERE date < before_date FETCH FIRST 100 ROWS ONLY );
COMMIT;
END WHILE;
END P1
@
--#SET TERMINATOR ;
Whose approach was unceremoniously lifted from this thread. My programmatic approach is as follows:
public static void PurgeOrdersBefore( DateTime date ) {
using ( OleDbConnection connection = DatabaseUtil.GetInstance( ).GetConnection( ) ) {
connection.Open( );
OleDbCommand command = new OleDbCommand( deleteOrdersBefore, connection );
command.Parameters.Add( "@Date", OleDbType.DBTimeStamp ).Value = date;
int rows = 0;
int loopRows = 0;
int loopIterations = 0;
log.Info( "starting PurgeOrdersBefore loop" );
while ( true ) {
command.Transaction = connection.BeginTransaction( );
loopRows = command.ExecuteNonQuery( );
command.Transaction.Commit( );
if ( loopRows <= 0 ) {
break;
}
if ( log.IsDebugEnabled ) log.Debug( "purged " + loopRows + " in loop iteration " + loopIterations );
loopIterations++;
rows += loopRows;
}
if ( log.IsInfoEnabled ) log.Info( "purged " + rows + " orders in " + loopIterations + " loop iterations" );
}
}
I performed a VERY primitive test in which I printed a timestamp at the start and finish and broke out of the loop after 10,000 in each. The outcome of said test was that the stored procedure took slightly over 6 minutes to delete 10,000 rows and the programmatic approach took just under 5 minutes. Being as primitive as it was, I imagine the only conclusion I can draw is that their is likely going to be very minimal difference in practice and keeping the loop in the C# code allows for much more dynamic monitoring.
All that said, does anyone else have any input on the subject? Could you explain what kind of hidden benefits I might receive were I to use the stored procedure approach? In particular, if Serge Rielau keeps an eye on this site, I would love to hear what you have to say (seems to be that he is the ninja all the others refer to when it comes to DB2 nonsense like this...)
-------------- Edit ---------------------
How about an export of some sort followed by a LOAD REPLACE? Has anyone done that before? Is there an example that I could follow? What implications would that have?
If the number of records to delete is a large fraction of the total, it can be cheaper to copy the good records into a temporary table, empty the original table, and copy the temp table back. The optimal way to do this is not consistent across RDBMSes; for example, some support TRUNCATE and others do not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With