Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch commit on large INSERT operation in native SQL?

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.

If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)

I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).

Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?

like image 904
Cade Roux Avatar asked Oct 21 '09 17:10

Cade Roux


2 Answers

Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:

SELECT key_columns INTO dbo.temp FROM dbo.HugeView;

(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)

Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:

SET NOCOUNT ON;

DECLARE
    @batchsize INT,
    @ctr INT,
    @rc INT;

SELECT
    @batchsize = 10000,
    @ctr = 0;

WHILE 1 = 1
BEGIN
    WITH x AS
    (
        SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
        FROM dbo.temp
    )
    INSERT dbo.PrimaryTable(a, b, c, etc.)
        SELECT v.a, v.b, v.c, etc.
        FROM x
        INNER JOIN dbo.HugeView AS v
        ON v.key_column = x.key_column
        WHERE x.rn > @batchsize * @ctr
        AND x.rn <= @batchsize * (@ctr + 1);

    IF @@ROWCOUNT = 0
        BREAK;

    BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;

    SET @ctr = @ctr + 1;
END

That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:

  • Break large delete operations into chunks

Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.

like image 50
Aaron Bertrand Avatar answered Sep 28 '22 02:09

Aaron Bertrand


You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.

create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever

while ....
    insert into yourtable(...)
    select * from whatever 
    where YEAR(DateCol) = year and MONTH(DateCol) = month
end
like image 20
Arthur Avatar answered Sep 28 '22 03:09

Arthur