I have a data factory with a pipeline copy activity like this:
{
"type": "Copy",
"name": "Copy from storage to SQL",
"inputs": [
{
"name": "storageDatasetName"
}
],
"outputs": [
{
"name": "sqlOutputDatasetName"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
}
}
The input data is approx 90MB in size, about 1.5 million rows, broken into approx. 20 x 4.5MB block blob files in Azure Storage. Here's an example of the data (CSV):
A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986 A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536 A81001,7,4,1,32,1342,278.0,7.5578647,316.95795,65.65895
The sink is an Azure SQL Server of type S2, which is rated at 50 DTUs. I've created a simple table with sensible data types, and no keys, indexes or anything fancy, just columns:
CREATE TABLE [dbo].[Prescriptions](
[Practice] [char](6) NOT NULL,
[BnfChapter] [tinyint] NOT NULL,
[BnfSection] [tinyint] NOT NULL,
[BnfParagraph] [tinyint] NOT NULL,
[TotalItems] [int] NOT NULL,
[TotalQty] [int] NOT NULL,
[TotalActCost] [float] NOT NULL,
[TotalItemsPerThousand] [float] NOT NULL,
[TotalQtyPerThousand] [float] NOT NULL,
[TotalActCostPerThousand] [float] NOT NULL
)
The source, sink and data factory are all in the same region (North Europe).
According to Microsoft's 'Copy activity performance and tuning guide', for Azure Storage Source and Azure SQL S2 sink, I should be getting about 0.4 MBps. By my calculation, that means 90MB should transfer in about half and hour (is that right?).
For some reason it copies 70,000 rows very quickly, then seems to hang. Using SQL management studio I can see the count of rows in the database table is exactly 70,000 and hasn't increased at all in 7 hours. Yet the copy task is still running with no errors:
Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.
To improve performance, you can use staged copy to compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. Then you can decompress the data in the staging store before you load into the destination data store.
To load data from Azure blob storage and save it in a table inside of your database, use the [CREATE TABLE AS SELECT][] (CTAS) T-SQL statement.
Just to answer my own question in case it helps anyone else:
The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.
So I made two changes: added the following property to my blob dataset definition:
"nullValue": "\\N"
And made my SQL table column nullable. It now runs completely and doesn't hang! :)
The problem is that the Data Factory did not error, it just got stuck - it would be nice if the job had failed with a useful error message, and told me what row of the data was the problem. I think because the write batch size is 10,000 by default, this is why it got stuck at 70,000 and not at 76560.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With