Azure data factory copy activity from Storage to SQL: hangs at 70000 rows

Tags:

I have a data factory with a pipeline copy activity like this:

{
  "type": "Copy",
  "name": "Copy from storage to SQL",
  "inputs": [
    {
      "name": "storageDatasetName"
    }
  ],
  "outputs": [
    {
      "name": "sqlOutputDatasetName"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "BlobSource"
    },
    "sink": {
      "type": "SqlSink"
    }
  },
  "policy": {
    "concurrency": 1,
    "retry": 3
  },
  "scheduler": {
    "frequency": "Month",
    "interval": 1
  }
}

The input data is approx 90MB in size, about 1.5 million rows, broken into approx. 20 x 4.5MB block blob files in Azure Storage. Here's an example of the data (CSV):

A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986 A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536 A81001,7,4,1,32,1342,278.0,7.5578647,316.95795,65.65895

The sink is an Azure SQL Server of type S2, which is rated at 50 DTUs. I've created a simple table with sensible data types, and no keys, indexes or anything fancy, just columns:

CREATE TABLE [dbo].[Prescriptions](
    [Practice] [char](6) NOT NULL,
    [BnfChapter] [tinyint] NOT NULL,
    [BnfSection] [tinyint] NOT NULL,
    [BnfParagraph] [tinyint] NOT NULL,
    [TotalItems] [int] NOT NULL,
    [TotalQty] [int] NOT NULL,
    [TotalActCost] [float] NOT NULL,
    [TotalItemsPerThousand] [float] NOT NULL,
    [TotalQtyPerThousand] [float] NOT NULL,
    [TotalActCostPerThousand] [float] NOT NULL
)

The source, sink and data factory are all in the same region (North Europe).

According to Microsoft's 'Copy activity performance and tuning guide', for Azure Storage Source and Azure SQL S2 sink, I should be getting about 0.4 MBps. By my calculation, that means 90MB should transfer in about half and hour (is that right?).

enter image description here

For some reason it copies 70,000 rows very quickly, then seems to hang. Using SQL management studio I can see the count of rows in the database table is exactly 70,000 and hasn't increased at all in 7 hours. Yet the copy task is still running with no errors:

enter image description here

Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.

422

asked Mar 22 '16 17:03

James Allen

1 Answers

Just to answer my own question in case it helps anyone else:

The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.

So I made two changes: added the following property to my blob dataset definition:

"nullValue": "\\N"

And made my SQL table column nullable. It now runs completely and doesn't hang! :)

The problem is that the Data Factory did not error, it just got stuck - it would be nice if the job had failed with a useful error message, and told me what row of the data was the problem. I think because the write batch size is 10,000 by default, this is why it got stuck at 70,000 and not at 76560.

answered Nov 07 '22 10:11

James Allen

Related questions
                            
                                Java object which takes the least memory
                            
                                Cython code 3-4 times slower than Python / Numpy code?
                            
                                HDFS performance for small files
                            
                                What is the performance overhead of nested functions? [duplicate]
                            
                                how to compile only changed source files using ANT
                            
                                SELECT DISTINCT statement in MySQL is taking 10 minutes
                            
                                draw over 4K polylines in android google maps
                            
                                Causes of clr ! JIT_New in PerfView CPU stack
                            
                                How to do something n times per second?
                            
                                Performance of uu-parsinglib compared to "try" in Parsec
                            
                                Would it benefit to pre-compile jade templates on production in express
                            
                                Is there a non-blocking method to check for data faster than select() and poll()?
                            
                                Vector performance in Go and C++
                            
                                Python faster alternative to dictionary? [duplicate]
                            
                                How to speed up or vectorize a for loop?
                            
                                Write double (triple) sum as inner product?
                            
                                How to increase the request per second on amazon EC2 T2.micro instance?
                            
                                Why does malloc rely on mmap starting from a certain threshold?
                            
                                SQL query that can find Typos in Arabic language
                            
                                Faster lookup tables using AVX2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Azure data factory copy activity from Storage to SQL: hangs at 70000 rows

Tags:

performance

azure-sql-database

azure

azure-blob-storage

azure-data-factory

James Allen

People also ask

1 Answers

James Allen

Recent Activity

Donate For Us