Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Moving Millions of items from one Storage Account to Another

I have somewhere in the neighborhood of 4.2 million images I need to move from North Central US to West US, as part of a large migration to take advantage of Azure VM support (for those who don't know, North Central US does not support them). The images are all in one container, split into about 119,000 directories.

I'm using the following from the Copy Blob API:

public static void CopyBlobDirectory(
        CloudBlobDirectory srcDirectory,
        CloudBlobContainer destContainer)
{
    // get the SAS token to use for all blobs
    string blobToken = srcDirectory.Container.GetSharedAccessSignature(
        new SharedAccessBlobPolicy
        {
            Permissions = SharedAccessBlobPermissions.Read |
                            SharedAccessBlobPermissions.Write,
            SharedAccessExpiryTime = DateTime.UtcNow + TimeSpan.FromDays(14)
        });

    var srcBlobList = srcDirectory.ListBlobs(
        useFlatBlobListing: true,
        blobListingDetails: BlobListingDetails.None).ToList();

    foreach (var src in srcBlobList)
    {
        var srcBlob = src as ICloudBlob;

        // Create appropriate destination blob type to match the source blob
        ICloudBlob destBlob;
        if (srcBlob.Properties.BlobType == BlobType.BlockBlob)
            destBlob = destContainer.GetBlockBlobReference(srcBlob.Name);
        else
            destBlob = destContainer.GetPageBlobReference(srcBlob.Name);

        // copy using src blob as SAS
        destBlob.BeginStartCopyFromBlob(new Uri(srcBlob.Uri.AbsoluteUri + blobToken), null, null);          
    }
}

The problem is, it's too slow. Waaaay too slow. At the rate it's taking to issue commands to copy all of this stuff, It is going to take somewhere in the neighborhood of four days. I'm not really sure what the bottleneck is (connection limit client side, rate limiting on Azure's end, multithreading, etc).

So, I'm wondering what my options are. Is there any way to speed things up, or am I just stuck with a job that will take four days to complete?

Edit: How I'm distributing the work to copy everything

//set up tracing
InitTracer();

//grab a set of photos to benchmark this
var photos = PhotoHelper.GetAllPhotos().Take(500).ToList();

//account to copy from
var from = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
    "oldAccount",
    "oldAccountKey");
var fromAcct = new CloudStorageAccount(from, true);
var fromClient = fromAcct.CreateCloudBlobClient();
var fromContainer = fromClient.GetContainerReference("userphotos");

//account to copy to
var to = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
    "newAccount",
    "newAccountKey");
var toAcct = new CloudStorageAccount(to, true);
var toClient = toAcct.CreateCloudBlobClient();

Trace.WriteLine("Starting Copy: " + DateTime.UtcNow.ToString());

//enumerate sub directories, then move them to blob storage
//note: it doesn't care how high I set the Parallelism to,
//console output indicates it won't run more than five or so at a time
var plo = new ParallelOptions { MaxDegreeOfParallelism = 10 };
Parallel.ForEach(photos, plo, (info) =>
{
    CloudBlobDirectory fromDir = fromContainer.GetDirectoryReference(info.BuildingId.ToString());

    var toContainer = toClient.GetContainerReference(info.Id.ToString());
    toContainer.CreateIfNotExists();

    Trace.WriteLine(info.BuildingId + ": Starting copy, " + info.Photos.Length + " photos...");

    BlobHelper.CopyBlobDirectory(fromDir, toContainer, info);
    //this monitors the container, so I can restart any failed
    //copies if something goes wrong
    BlobHelper.MonitorCopy(toContainer);
});

Trace.WriteLine("Done: " + DateTime.UtcNow.ToString());
like image 890
Dusda Avatar asked Oct 21 '22 16:10

Dusda


1 Answers

The async blob copy operation is going to be very fast within the same data center (recently I copied a 30GB vhd to another blob in about 1-2 seconds). Across data centers, the operation is queued up and occurs across spare capacity with no SLA (see this article which calls that out specifically)

To put that into perspective: I copied the same 30GB VHD across data centers and it took around 1 hour.

I don't know your image sizes, but assuming 500K average image size, you're looking at about 2,000 GB. In my example, I saw throughput of 30GB in about an hour. Extrapolating, that would estimate your 2000 GB of data in about (2000/30) = 60 hours. Again, no SLA. Just a best-guess.

Someone else suggested disabling Nagle's algorithm. That should help push the 4 million copy commands out faster and get them queued up faster. I don't think it will have any effect of copy time.

like image 121
David Makogon Avatar answered Oct 24 '22 13:10

David Makogon