Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looping through DELETE in code vs in stored procedure

I have been struggling with deleting massive quantities of old data from a database. Each of 5 different tables has as many as 50M rows that need to be deleted. No single delete statement could handle that quantity of data, so I have to loop through deleting a few at a time. My question is to whether there is any noticeable performance increase in looping within a stored procedure instead of looping in the application code. Now for the specifics, I am using DB2 (9.7 CE), and coding in C#. For my stored procedure I use:

--#SET TERMINATOR ;
DROP PROCEDURE myschema.purge_orders_before;
--#SET TERMINATOR @
CREATE PROCEDURE myschema.purge_orders_before (IN before_date TIMESTAMP)
    DYNAMIC RESULT SETS 1
P1: BEGIN
    DECLARE no_data SMALLINT DEFAULT 0;
    DECLARE deadlock_encountered SMALLINT DEFAULT 0;
    DECLARE deadlock_condition CONDITION FOR SQLSTATE '40001';

    DECLARE CONTINUE HANDLER FOR NOT FOUND
        SET no_data = 1;

    -- The deadlock_encountered attribute is throw-away,
    -- but a continue handler needs to do something,
    -- i.e., it's not enough to just declare a handler,
    -- it has to have an action in its body.
    DECLARE CONTINUE HANDLER FOR deadlock_condition
        SET deadlock_encountered = 1;

    WHILE (no_data = 0 ) DO
        DELETE FROM 
            (SELECT 1 FROM myschema.orders WHERE date < before_date FETCH FIRST 100 ROWS ONLY );
        COMMIT;
    END WHILE;
END P1
@
--#SET TERMINATOR ;

Whose approach was unceremoniously lifted from this thread. My programmatic approach is as follows:

public static void PurgeOrdersBefore( DateTime date ) {
    using ( OleDbConnection connection = DatabaseUtil.GetInstance( ).GetConnection( ) ) {
        connection.Open( );
        OleDbCommand command = new OleDbCommand( deleteOrdersBefore, connection );
        command.Parameters.Add( "@Date", OleDbType.DBTimeStamp ).Value = date;
        int rows = 0;
        int loopRows = 0;
        int loopIterations = 0;

        log.Info( "starting PurgeOrdersBefore loop" );
        while ( true ) {
            command.Transaction = connection.BeginTransaction( );
            loopRows = command.ExecuteNonQuery( );
            command.Transaction.Commit( );
            if ( loopRows <= 0 ) {
                break;
            }
            if ( log.IsDebugEnabled ) log.Debug( "purged " + loopRows + " in loop iteration " + loopIterations );

            loopIterations++;
            rows += loopRows;
        }
        if ( log.IsInfoEnabled ) log.Info( "purged " + rows + " orders in " + loopIterations + " loop iterations" );
    }
}

I performed a VERY primitive test in which I printed a timestamp at the start and finish and broke out of the loop after 10,000 in each. The outcome of said test was that the stored procedure took slightly over 6 minutes to delete 10,000 rows and the programmatic approach took just under 5 minutes. Being as primitive as it was, I imagine the only conclusion I can draw is that their is likely going to be very minimal difference in practice and keeping the loop in the C# code allows for much more dynamic monitoring.

All that said, does anyone else have any input on the subject? Could you explain what kind of hidden benefits I might receive were I to use the stored procedure approach? In particular, if Serge Rielau keeps an eye on this site, I would love to hear what you have to say (seems to be that he is the ninja all the others refer to when it comes to DB2 nonsense like this...)

-------------- Edit ---------------------

How about an export of some sort followed by a LOAD REPLACE? Has anyone done that before? Is there an example that I could follow? What implications would that have?

like image 707
Lucas Avatar asked Mar 18 '26 01:03

Lucas


1 Answers

If the number of records to delete is a large fraction of the total, it can be cheaper to copy the good records into a temporary table, empty the original table, and copy the temp table back. The optimal way to do this is not consistent across RDBMSes; for example, some support TRUNCATE and others do not.

like image 123
Andrew Lazarus Avatar answered Mar 20 '26 14:03

Andrew Lazarus