Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch Uploading Huge Sets of Images to Azure Blob Storage

I have about 110,000 images of various formats (jpg, png and gif) and sizes (2-40KB) stored locally on my hard drive. I need to upload them to Azure Blob Storage. While doing this, I need to set some metadata and the blob's ContentType, but otherwise it's a straight up bulk upload.

I'm currently using the following to handle uploading one image at a time (paralleled over 5-10 concurrent Tasks).

static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
    //convert image to bytes
    using(MemoryStream ms = new MemoryStream())
    {
        pic.Save(ms, format);
        ms.Position = 0;

        //create the blob, set metadata and properties
        var blob = container.GetBlobReference(filename);
        blob.Metadata["Filename"] = filename;
        blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));

        //upload!
        blob.UploadFromStream(ms);
        blob.SetMetadata();
        blob.SetProperties();
    }
}

I was wondering if there was another technique I could employ to handle the uploading, to make it as fast as possible. This particular project involves importing a lot of data from one system to another, and for customer reasons it needs to happen as quickly as possible.

like image 379
Dusda Avatar asked Oct 10 '11 22:10

Dusda


People also ask

How to upload multiple images to Blob Storage in azure?

Users can choose multiple images of different kinds to upload to Blob storage. Create a storage account in Azure. 2.3: Set the Azure Account key and Container Name where we store images in a web.config file Create one Controller in the App_Start folder name is “AzureHelper.cs” and then paste following code.

How do I upload files from a local folder to Azure?

In addition to AzCopy, Powershell can also be used to upload files from a local folder to Azure storage. The Azure PowerShell command Set-AzStorageBlobContent is used for the same purpose.

How do I upload files to a blob container?

The script will upload files by renaming it to a GUID value to have a unique name in the blob container. Define some variables and create a resource group. Looping for the variable $numberOfBatch times and upload files which are available at the location variable $fileslocation.

How do I use Azure Storage blobs with NuGet?

Install the Azure Storage Blobs Batch client library for .NET with NuGet: You need an Azure subscription and a Storage Account to use this package. To create a new Storage Account, you can use the Azure Portal , Azure PowerShell, or the Azure CLI .


4 Answers

Okay, here's what I did. I tinkered around with running BeginUploadFromStream(), then BeginSetMetadata(), then BeginSetProperties() in an asynchronous chain, paralleled over 5-10 threads (a combination of ElvisLive's and knightpfhor's suggestions). This worked, but anything over 5 threads had terrible performance, taking upwards of 20 seconds for each thread (working on a page of ten images at a time) to complete.

So, to sum up the performance differences:

  • Asynchronous: 5 threads, each running an async chain, each working on ten images at a time (paged for statistical reasons): ~15.8 seconds (per thread).
  • Synchronous: 1 thread, ten images at a time (paged for statistical reasons): ~3.4 seconds

Okay, that's pretty interesting. One instance uploading blobs synchronously performed 5x better than each thread in the other approach. So, even running the best async balance of 5 threads nets essentially the same performance.

So, I tweaked my image file importing to separate the images into folders containing 10,000 images each. Then I used Process.Start() to launch an instance of my blob uploader for each folder. I have 170,000 images to work with in this batch, so that means 17 instances of the uploader. When running all of those on my laptop, performance across all of them leveled out at ~4.3 seconds per set.

Long story short, instead of trying to get threading working optimally, I just run a blob uploader instance for every 10,000 images, all on the one machine at the same time. Total performance boost?

  • Async Attempts: 14-16 hours, based on average execution time when running it for an hour or two.
  • Synchronous with 17 separate instances: ~1 hour, 5 minutes.
like image 167
Dusda Avatar answered Sep 30 '22 13:09

Dusda


You should definitely upload in parallel in several streams (ie. post multiple files concurrently), but before you do any experiment showing (erroneously) that there is not benefit, make sure you actually increase the value of ServicePointManager.DefaultConnectionLimit:

The maximum number of concurrent connections allowed by a ServicePoint object. The default value is 2.

With a default value of 2, you can have at most two outstanding HTTP requests against any destination.

like image 42
Remus Rusanu Avatar answered Sep 30 '22 14:09

Remus Rusanu


As the files that you're uploading are pretty small, I think the code that you've written is probably about as efficient as you can get. Based on your comment it looks like you've tried running these uploads in parallel which was really the only other code suggestion I had.

I suspect that in order to get the greatest throughput will be about finding the right number of threads for your hardware, your connection and your file size. You could try using the Azure Throughput Analyzer to make finding this balance easier.

Microsoft's Extreme Computing group have also benchmarks and suggestions on improving throughput. It's focused on throughput from worker roles deployed on Azure, but it will give you an idea of the best you could hope for.

like image 38
knightpfhor Avatar answered Sep 30 '22 12:09

knightpfhor


You may want to increase ParallelOperationThreadCount as shown below. I haven't checked the latest SDK, but in 1.3 the limit was 64. Not setting this value resulted in lower concurrent operations.

CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);
// todo: set this in blob extensions
blobStorage.ParallelOperationThreadCount = 64
like image 33
makerofthings7 Avatar answered Sep 30 '22 14:09

makerofthings7