I have somewhere in the neighborhood of 4.2 million images I need to move from North Central US to West US, as part of a large migration to take advantage of Azure VM support (for those who don't know, North Central US does not support them). The images are all in one container, split into about 119,000 directories.
I'm using the following from the Copy Blob API:
public static void CopyBlobDirectory(
CloudBlobDirectory srcDirectory,
CloudBlobContainer destContainer)
{
// get the SAS token to use for all blobs
string blobToken = srcDirectory.Container.GetSharedAccessSignature(
new SharedAccessBlobPolicy
{
Permissions = SharedAccessBlobPermissions.Read |
SharedAccessBlobPermissions.Write,
SharedAccessExpiryTime = DateTime.UtcNow + TimeSpan.FromDays(14)
});
var srcBlobList = srcDirectory.ListBlobs(
useFlatBlobListing: true,
blobListingDetails: BlobListingDetails.None).ToList();
foreach (var src in srcBlobList)
{
var srcBlob = src as ICloudBlob;
// Create appropriate destination blob type to match the source blob
ICloudBlob destBlob;
if (srcBlob.Properties.BlobType == BlobType.BlockBlob)
destBlob = destContainer.GetBlockBlobReference(srcBlob.Name);
else
destBlob = destContainer.GetPageBlobReference(srcBlob.Name);
// copy using src blob as SAS
destBlob.BeginStartCopyFromBlob(new Uri(srcBlob.Uri.AbsoluteUri + blobToken), null, null);
}
}
The problem is, it's too slow. Waaaay too slow. At the rate it's taking to issue commands to copy all of this stuff, It is going to take somewhere in the neighborhood of four days. I'm not really sure what the bottleneck is (connection limit client side, rate limiting on Azure's end, multithreading, etc).
So, I'm wondering what my options are. Is there any way to speed things up, or am I just stuck with a job that will take four days to complete?
Edit: How I'm distributing the work to copy everything
//set up tracing
InitTracer();
//grab a set of photos to benchmark this
var photos = PhotoHelper.GetAllPhotos().Take(500).ToList();
//account to copy from
var from = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"oldAccount",
"oldAccountKey");
var fromAcct = new CloudStorageAccount(from, true);
var fromClient = fromAcct.CreateCloudBlobClient();
var fromContainer = fromClient.GetContainerReference("userphotos");
//account to copy to
var to = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"newAccount",
"newAccountKey");
var toAcct = new CloudStorageAccount(to, true);
var toClient = toAcct.CreateCloudBlobClient();
Trace.WriteLine("Starting Copy: " + DateTime.UtcNow.ToString());
//enumerate sub directories, then move them to blob storage
//note: it doesn't care how high I set the Parallelism to,
//console output indicates it won't run more than five or so at a time
var plo = new ParallelOptions { MaxDegreeOfParallelism = 10 };
Parallel.ForEach(photos, plo, (info) =>
{
CloudBlobDirectory fromDir = fromContainer.GetDirectoryReference(info.BuildingId.ToString());
var toContainer = toClient.GetContainerReference(info.Id.ToString());
toContainer.CreateIfNotExists();
Trace.WriteLine(info.BuildingId + ": Starting copy, " + info.Photos.Length + " photos...");
BlobHelper.CopyBlobDirectory(fromDir, toContainer, info);
//this monitors the container, so I can restart any failed
//copies if something goes wrong
BlobHelper.MonitorCopy(toContainer);
});
Trace.WriteLine("Done: " + DateTime.UtcNow.ToString());
The async blob copy operation is going to be very fast within the same data center (recently I copied a 30GB vhd to another blob in about 1-2 seconds). Across data centers, the operation is queued up and occurs across spare capacity with no SLA (see this article which calls that out specifically)
To put that into perspective: I copied the same 30GB VHD across data centers and it took around 1 hour.
I don't know your image sizes, but assuming 500K average image size, you're looking at about 2,000 GB. In my example, I saw throughput of 30GB in about an hour. Extrapolating, that would estimate your 2000 GB of data in about (2000/30) = 60 hours. Again, no SLA. Just a best-guess.
Someone else suggested disabling Nagle's algorithm. That should help push the 4 million copy commands out faster and get them queued up faster. I don't think it will have any effect of copy time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With