Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel Bulk Inserting with SqlBulkCopy and Azure

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.

I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.

I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...

Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?

like image 802
kyliod Avatar asked Mar 01 '12 15:03

kyliod


People also ask

Is bulk insert a single transaction?

The bulk insert operation is broken into batches, each batch is treated in its own transaction so the whole operation isn't treated under a single transaction.

What is codepage in bulk insert?

The CODEPAGE option is used when you need to load extended characters (values greater than 127); this option allows you to specify one of the following values for char, varchar, and text datatypes: ACP. Convert from the ANSI/Microsoft Windows code page (ISO 1252) to the SQL Server code page. OEM.

Why bulk insert is faster?

In case of BULK INSERT, only extent allocations are logged instead of the actual data being inserted. This will provide much better performance than INSERT. The actual advantage, is to reduce the amount of data being logged in the transaction log.


2 Answers

I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.

I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.

In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc

So doing several large bulk inserts at the same time will probably cause one to time out.

like image 160
Matt Warren Avatar answered Nov 12 '22 11:11

Matt Warren


It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.

Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.

This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

like image 21
Herve Roggero Avatar answered Nov 12 '22 13:11

Herve Roggero