Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating 4 million records in SQL server using list of record-ids as input

During a migration project, I'm faced with an update of 4 millions records in our SQL Server.

The update is very simple ; a boolean field needs to be set to true/1 and the input I have is a list of all the id's for which this field must be filled.(one id per line)

I'm not exactly an expert when it comes to sql tasks of this size, so I started out trying 1 UPDATE statement containing a "WHERE xxx IN ( {list of ids, separated by comma} )". First, I tried this with a million records. On a small dataset on a test-server, this worked like a charm, but in the production environment this gave an error. So, I shortened the length of the list of ids a couple of times, but to no avail.

The next thing I tried was to turn each id in the list into an UPDATE statement ("UPDATE yyy SET booleanfield = 1 WHERE id = '{id}'"). Somewhere, I read that it's good to have a GO every x number of lines, so I inserted a GO every 100 lines (using the excellent 'sed' tool, ported from unix).

So, I separated the list of 4 million update statements into parts of 250.000 each, saved them as sql files and started loading and running the first one into SQL Server Management Studio (2008). Do note that I also tried SQLCMD.exe, but this, to my surprise, ran about 10-20 times slower than SQL Studio.

It took about 1,5 hour to complete and resulted in "Query completed with errors". The messages-list however, contained a nice list of "1 row(s) affected" and "0 row(s) affected", the latter for when the id was not found.

Next, I checked the amount of updated records in the table using a COUNT(*) and found that there was a difference of a couple of thousand records between the amount of update statements and the amount of updated records.

I then thought that that might be due to the non-existent records, but when I substracted the amount of "0 row(s) affected" in the output, there was a mysterious gap of 895 records.

My questions :

  1. Is there any way to find out a description and cause of the errors in "Query completed with errors."

  2. How could the mysterious gap of 895 records be explained ?

  3. What's a better, or the best, way to do this update ? (as I'm starting to think what I'm doing could be very inefficient and/or error-prone)

like image 271
RNobel Avatar asked Feb 09 '13 17:02

RNobel


People also ask

How do you UPDATE a million records?

One of my favorite ways of dealing with millions of records in a table is processing inserts, deletes, or updates in batches. Updating data in batches of 10,000 records at a time and using a transaction is a simple and efficient way of performing updates on millions of records.

What is fastest way to execute the query with millions of records?

1:- Check Indexes. 2:- There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement 3:- Limit Size of Your Working Data Set. 4:- Only Select Fields You select as Need. 5:- Remove Unnecessary Table and index 6:- Remove OUTER JOINS.


2 Answers

The best way to approach this ask is by inserting the 4 million records into a table. In fact, you can put them into a table with an identity column, by "bulk inserting" into a view.

create table TheIds (rownum int identity(1,1), id int);

create view v_TheIds (select id from TheIds);

bulk insert into v_TheIds . . .

With all the data in the database, you now have many more options. Try the update:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id)

You should also create an index on TheIds(id).

This is a large update, all executing as one transaction. That can have bad performance implications and start to fill the log. You can break it into smaller transactions using the rownum column:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id and TheIds.rownum < 1000)

The exists clause here is doing the equivalent of the left outer join. The major difference is that this correlated subquery syntax should work in other databases, where joins with updates are database-specific.

With the rownum column, you can select as many rows as you want for the update. So, you can put the update in a loop, if the overall update is too big:

where rownum < 100000
where rownum between 100000 and 199999
where rownum between 200000 and 299999

and so on. You don't have to do this, but you can if you want to batch the updates for some reason.

The key idea is to get the list of ids into a table in the database, so you can use the power of the database for the subsequent operations.

like image 115
Gordon Linoff Avatar answered Nov 15 '22 07:11

Gordon Linoff


Warning: I have not been able to test it and I do not have a "playground database" which can hold that much data.

I am not sure about 1. and 2. but for 3. you should be better off leaving the limiting of the update to the DB:

UPDATE TOP(100000) yyy
SET booleanfield = 1
WHERE booleanfield = 0
GO

though the documentation says to "randomly select" some entries with that TOP-limitation - I hope it only does so from the ones having booleanfield = 0. Run that query repeatedly until no more updates are reported.

Another option if the above does not work is to select the affected ids directly from the DB ... this looks odd and I have not tested it either, but I hope it works:

UPDATE yyy
SET booleanfield = 1
FROM (SELECT TOP 100000 id FROM yyy WHERE booleanfield = 0 ORDER BY id ASC) AS xxxx
WHERE yyy.id = xxxx.id;
GO

(I assume here you have an unique key id in the table here). Run this query several (about 40) times until no more updates are reported.

like image 26
Clemens Klein-Robbenhaar Avatar answered Nov 15 '22 06:11

Clemens Klein-Robbenhaar