Copy one column to another for over a billion rows in SQL Server database

Tags:

Database : SQL Server 2005

Problem : Copy values from one column to another column in the same table with a billion+ rows.

test_table (int id, bigint bigid)

Things tried 1: update query

update test_table set bigid = id

fills up the transaction log and rolls back due to lack of transaction log space.

Tried 2 - a procedure on following lines

set nocount on
set rowcount = 500000
while @rowcount > 0
begin
 update test_table set bigid = id where bigid is null
 set @rowcount = @@rowcount
 set @rowupdated = @rowsupdated + @rowcount
end
print @rowsupdated

The above procedure starts slowing down as it proceeds.

Tried 3 - Creating a cursor for update.

generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.

Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.

Any hints, pointers will be much appreciated.

661

asked Sep 22 '10 18:09

Adi Pandit

2 Answers

I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)

Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.

Log Growth

The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.

If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.

Indexes and Speed

ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.

The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:

set @counter = 1
while @counter < 2000000000 --or whatever
begin
  update test_table set bigid = id 
  where id between @counter and (@counter + 499999) --BETWEEN is inclusive
  set @counter = @counter + 500000
end

This should be extremely fast, because of the existing indexes on ID.

The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.

143

answered Oct 08 '22 19:10

BradC

Use TOP in the UPDATE statement:

UPDATE TOP (@row_limit) dbo.test_table
   SET bigid = id 
 WHERE bigid IS NULL

answered Oct 08 '22 19:10

OMG Ponies

Related questions
                            
                                "contains" in Bigquery standard SQL
                            
                                What is the equivalent of Python Pandas value_counts in SQL?
                            
                                Room - Is it possible to use OFFSET and FETCH NEXT in a query?
                            
                                PostgreSQL: Aggregate multiple rows as JSON array based on specific column
                            
                                How to cast bigint to timestamp with time zone in postgres in an update
                            
                                Field value must be unique unless it is NULL
                            
                                CHECKSUM() collisions in SQL Server 2005
                            
                                SQL Divide by Two Count()
                            
                                What is the internal representation of datetime in sql server?
                            
                                How can I optimize/refactor a TSQL "LIKE" clause?
                            
                                How to order by maximum of two column which can be null in MySQL?
                            
                                Timeout exception causes SqlDataReader to close?
                            
                                SQL Server Geography datatype nearest point on line
                            
                                can we insert into two tables with single sql statement?
                            
                                MySQL: Unique constraint on multiple fields [duplicate]
                            
                                SQL Join to only the maximum row puzzle
                            
                                What's the execute order of the different parts of a SQL select statement?
                            
                                mysql: searching BETWEEN dates stored as varchar
                            
                                Select DISTINCT, return entire row
                            
                                return count 0 with mysql group by

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Copy one column to another for over a billion rows in SQL Server database

Tags:

sql

sql-server

tsql

sql-server-2005

large-data-volumes