Short version: Can we read from dozens or hundreds of table partitions in a multi-threaded manner to increase performance by orders of magnitude?
Long version: We're working on a system that is storing millions of rows in Azure table storage. We partition the data into small partitions, each one containing about 500 records, which represents a day worth of data for a unit.
Since Azure doesn't have a "sum" feature, to pull a year worth of data, we either have to use some pre-caching, or sum the data ourselves in an Azure web or worker role.
Assuming the following: - Reading a partition doesn't affect the performance of another - Reading a partition has a bottleneck based on network speed and server retrieval
We can then take a guess that if we wanted to quickly sum a lot of data on the fly (1 year, 365 partitions), we could use a massively parallel algorithm and it would scale almost perfectly to the number of threads. For example, we could use the .NET parallel extensions with 50+ threads and get a HUGE performance boost.
We're working on setting up some experiments, but I wanted to see if this has been done before. Since the .NET side is basically idle waiting on high-latency operations, this seems perfect for multi-threading.
With Azure Table, your throughput is limited to 20k operations per second while with Cosmos DB throughput is supported for up to 10 million operations per second.
References. Currently, Azure Tables only support 255 properties (columns) on a single entity (row) and a max row size of 1MB.
In this article, I will focus on Azure Table storage, which is a relatively low-priced, scalable NoSQL storage suitable for storing large amounts of data while keeping costs low.
An entity in Azure Storage can be up to 1MB in size. An entity in Azure Cosmos DB can be up to 2MB in size. Properties: A property is a name-value pair. Each entity can include up to 252 properties to store data.
There are limits imposed on the number of transactions that can be performed against a storage account and a particular partition or storage server in a given time period (somewhere around 500 req/s). So in that sense, there is a reasonable limit to the number of requests you could execute in parallel (before it will begin to look like a DoS attack).
Also, in implementation, I would be wary of concurrent connection limits imposed on the client, such as by System.Net.ServicePointManager
. I am not sure if the Azure storage client is subject to those limits; they might require adjustment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With