I have been tasked with downloading around 100 million rows of data from Azure Table Storage. The important thing here being speed.
The process we are using is downloading 10,000 rows from Azure Table storage. Process them into a local instance of Sql Server. While processing the rows it deletes 100 rows at a time from the Azure table. This process is threaded to have 8 threads downloading 10,000 rows at a time.
The only problem with this is that according to our calculations. It will take around 40 days to download and process the around 100 million rows we have stored. Does anyone know a faster way to accomplish this task?
A side question: During the download process Azure will send back xml that just does not have any data. It doesn't send back an error. But it sends this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<title type="text">CommandLogTable</title>
<id>azure-url/CommandLogTable</id>
<updated>2010-07-12T19:50:55Z</updated>
<link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0
Does anyone else have this problem and have a fix for it?
When querying entities out of Azure the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds.
An entity in Azure Storage can be up to 1MB in size. An entity in Azure Cosmos DB can be up to 2MB in size. Properties: A property is a name-value pair. Each entity can include up to 252 properties to store data.
Aside from suggestions about bandwidth limits, you could easily be running into storage account limits, as each table partition is limited to roughly 500 transactions per second.
Further: there's an optimization deployed (Nagle's algorithm) that could actually slow things down for small reads (such as your 1K data reads). Here's a blog post about disabling Nagling, which could potentially speed up your reads considerably, especially if you're running directly in an Azure service without Internet latency in the way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With