Updating 4 million records in SQL server using list of record-ids as input

Tags:

During a migration project, I'm faced with an update of 4 millions records in our SQL Server.

The update is very simple ; a boolean field needs to be set to true/1 and the input I have is a list of all the id's for which this field must be filled.(one id per line)

I'm not exactly an expert when it comes to sql tasks of this size, so I started out trying 1 UPDATE statement containing a "WHERE xxx IN ( {list of ids, separated by comma} )". First, I tried this with a million records. On a small dataset on a test-server, this worked like a charm, but in the production environment this gave an error. So, I shortened the length of the list of ids a couple of times, but to no avail.

The next thing I tried was to turn each id in the list into an UPDATE statement ("UPDATE yyy SET booleanfield = 1 WHERE id = '{id}'"). Somewhere, I read that it's good to have a GO every x number of lines, so I inserted a GO every 100 lines (using the excellent 'sed' tool, ported from unix).

So, I separated the list of 4 million update statements into parts of 250.000 each, saved them as sql files and started loading and running the first one into SQL Server Management Studio (2008). Do note that I also tried SQLCMD.exe, but this, to my surprise, ran about 10-20 times slower than SQL Studio.

It took about 1,5 hour to complete and resulted in "Query completed with errors". The messages-list however, contained a nice list of "1 row(s) affected" and "0 row(s) affected", the latter for when the id was not found.

Next, I checked the amount of updated records in the table using a COUNT(*) and found that there was a difference of a couple of thousand records between the amount of update statements and the amount of updated records.

I then thought that that might be due to the non-existent records, but when I substracted the amount of "0 row(s) affected" in the output, there was a mysterious gap of 895 records.

My questions :

Is there any way to find out a description and cause of the errors in "Query completed with errors."
How could the mysterious gap of 895 records be explained ?
What's a better, or the best, way to do this update ? (as I'm starting to think what I'm doing could be very inefficient and/or error-prone)

271

asked Feb 09 '13 17:02

RNobel

2 Answers

The best way to approach this ask is by inserting the 4 million records into a table. In fact, you can put them into a table with an identity column, by "bulk inserting" into a view.

create table TheIds (rownum int identity(1,1), id int);

create view v_TheIds (select id from TheIds);

bulk insert into v_TheIds . . .

With all the data in the database, you now have many more options. Try the update:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id)

You should also create an index on TheIds(id).

This is a large update, all executing as one transaction. That can have bad performance implications and start to fill the log. You can break it into smaller transactions using the rownum column:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id and TheIds.rownum < 1000)

The exists clause here is doing the equivalent of the left outer join. The major difference is that this correlated subquery syntax should work in other databases, where joins with updates are database-specific.

With the rownum column, you can select as many rows as you want for the update. So, you can put the update in a loop, if the overall update is too big:

where rownum < 100000
where rownum between 100000 and 199999
where rownum between 200000 and 299999

and so on. You don't have to do this, but you can if you want to batch the updates for some reason.

The key idea is to get the list of ids into a table in the database, so you can use the power of the database for the subsequent operations.

115

answered Nov 15 '22 07:11

Gordon Linoff

Warning: I have not been able to test it and I do not have a "playground database" which can hold that much data.

I am not sure about 1. and 2. but for 3. you should be better off leaving the limiting of the update to the DB:

UPDATE TOP(100000) yyy
SET booleanfield = 1
WHERE booleanfield = 0
GO

though the documentation says to "randomly select" some entries with that TOP-limitation - I hope it only does so from the ones having booleanfield = 0. Run that query repeatedly until no more updates are reported.

Another option if the above does not work is to select the affected ids directly from the DB ... this looks odd and I have not tested it either, but I hope it works:

UPDATE yyy
SET booleanfield = 1
FROM (SELECT TOP 100000 id FROM yyy WHERE booleanfield = 0 ORDER BY id ASC) AS xxxx
WHERE yyy.id = xxxx.id;
GO

(I assume here you have an unique key id in the table here). Run this query several (about 40) times until no more updates are reported.

answered Nov 15 '22 06:11

Clemens Klein-Robbenhaar

Related questions
                            
                                When are SQL Server Index Usage Stats Updated?
                            
                                GroupBy SqlFunction on QueryOver
                            
                                Comparing string against numeric field returning unexpected results
                            
                                Update Query based on condition
                            
                                Why do I get "The log file for database 'tempdb' is full"
                            
                                Which table is considered 'inner' in a nested loop join
                            
                                get table join with column value
                            
                                Complex SQL join with group by
                            
                                Relational Algebra equivalent of SQL "NOT IN"
                            
                                Guarantees when using user variables to number rows
                            
                                SQL how to retrieve the middle point between two given dates?
                            
                                SQLAlchemy Subclass/Inheritance Relationships
                            
                                How to update a column of char type in mysql to increase its length
                            
                                MS Access 2010 query pulls same records multiple times, sql challenge
                            
                                SQL convert 'DDMMYY' to datetime
                            
                                Conditional CASE statement syntax
                            
                                Which is more suitable for prices calculations in Firebird: decimal or numeric?
                            
                                How to get the call log from specific date in android
                            
                                SQL Query: Calculating the deltas in a time series
                            
                                Extract rows based on multiple previous rows' values in SQL Server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Updating 4 million records in SQL server using list of record-ids as input

Tags:

sql

sql-server

tsql

sql-server-2008

data-migration

RNobel

People also ask

2 Answers

Gordon Linoff

Clemens Klein-Robbenhaar

Recent Activity

Donate For Us