Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the most efficient way to update thousands of records

Tags:

c#

sql

sql-server

We have a C# application which parses data from text files. We then have to update records in our sql database based on the information in the text files. What's the most efficient way for passing the data from application to SQL server?

We currently use a delimited string and then loop through the string in a stored procedure to update the records. I am also testing using TVP (table valued parameter). Are there any other options out there?

Our files contain thousands of records and we would like a solution that takes the least amount of time.

like image 556
Anita Avatar asked Oct 31 '22 10:10

Anita


1 Answers

Please do not use a DataTable as that is just wasting CPU and memory for no benefit (other than possibly familiarity). I have detailed a very fast and flexible approach in my answer to the following questions, which is very similar to this one:

How can I insert 10 million records in the shortest time possible?

The example shown in that answer is for INSERT only, but it can easily be adapted to include UPDATE. Also, it uploads all rows in a single shot, but that can also be easily adapted to set a counter for X number of records and to exit the IEnumerable method after that many records have been passed in, and then close the file once there are no more records. This would require storing the File pointer (i.e. the stream) in a static variable to keep passing to the IEnumerable method so that it can be advanced and picked up at the most recent position the next time around. I have a working example of this method shown in the following answer, though it was using a SqlDataReader as input, but the technique is the same and requires very little modification:

How to split one big table that has 100 million data to multiple tables?

And for some perspective, 50k records is not even close to "huge". I have been uploading / merging / syncing data using the method I am showing here on 4 million row files and that hit several tables with 10 million (or more) rows.


Things to not do:

  • Use a DataTable: as I said, if you are just filling it for the purpose of using with a TVP, it is a waste of CPU, memory, and time.
  • Make 1 update at a time in parallel (as suggested in a comment on the question): this is just crazy. Relational database engines are heavily tuned to work most efficiently with sets, not singleton operations. There is no way that 50k inserts will be more efficient than even 500 inserts of 100 rows each. Doing it individually just guarantees more contention on the table, even if just row locks (it's 100k lock + unlock operations). Is could be faster than a single 50k row transaction that escalates to a table lock (as Aaron mentioned), but that is why you do it in smaller batches, just so long as small does not mean 1 row ;).
  • Set the batch size arbitrarily. Staying below 5000 rows is good to help reduce chances of lock escalation, but don't just pick 200. Experiment with several batch sizes (100, 200, 500, 700, 1000) and try each one a few times. You will see what is best for your system. Just make sure that the batch size is configurable though the app.config file or some other means (table in the DB, registry setting, etc) so that it can be changed without having to re-deploy code.
  • SSIS (powerful, but very bulky and not fun to debug)

Things which work, but not nearly as flexible as a properly done TVP (i.e. passing in a method that returns IEnumerable<SqlDataRecord>). These are ok, but why dump the records into a temp table just to have to parse them into the destination when you can do it all inline?

  • BCP / OPENROWSET(BULK...) / BULK INSERT
  • .NET's SqlBulkCopy
like image 195
Solomon Rutzky Avatar answered Nov 15 '22 05:11

Solomon Rutzky