How do we increase the speed of this query?
We have approximately 100 consumers within the span of 1-2 minutes
executing the following query. Each one of these runs represents 1 run of a consumption function.
TableQuery<T> treanslationsQuery = new TableQuery<T>()
.Where(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
, TableOperators.Or,
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
)
);
This query will yield approximately 5000 results.
Full code:
public static async Task<IEnumerable<T>> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
var items = new List<T>();
TableContinuationToken token = null;
do
{
TableQuerySegment<T> seg = await table.ExecuteQuerySegmentedAsync(query, token);
token = seg.ContinuationToken;
items.AddRange(seg);
} while (token != null);
return items;
}
public static IEnumerable<Translation> Get<T>(string sourceParty, string destinationParty, string wildcardSourceParty, string tableName) where T : ITableEntity, new()
{
var acc = CloudStorageAccount.Parse(Environment.GetEnvironmentVariable("conn"));
var tableClient = acc.CreateCloudTableClient();
var table = tableClient.GetTableReference(Environment.GetEnvironmentVariable("TableCache"));
var sourceDestinationPartitionKey = $"{sourceParty.ToLowerTrim()}-{destinationParty.ToLowerTrim()}";
var anySourceDestinationPartitionKey = $"{wildcardSourceParty}-{destinationParty.ToLowerTrim()}";
TableQuery<T> treanslationsQuery = new TableQuery<T>()
.Where(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
, TableOperators.Or,
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
)
);
var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
return over1000Results.Where(x => x.expireAt > DateTime.Now)
.Where(x => x.effectiveAt < DateTime.Now);
}
During these executions, when there are 100 consumers, as you can see the requests will cluster and form spikes:
During these spikes, the requests often take over 1 minute:
How do we increase the speed of this query?
When querying entities out of Azure the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds.
Windows Azure storage partitions. All Windows Azure storage abstractions (Blob, Table, Queue) are built upon the same stack (whitepaper here). While there’s much more to tell about it, the reason why it scales is because of its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system.
In Table Storage, you have to decide on the PartitionKey yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use.
By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on RowKey within that partition. Less fast but still fast enough will be querying by specifying PartitionKey: table storage will know which partition to query.
There is 3 things you can consider:
1. First of all, get rid of your Where
clauses that you perform on the query result. It's better to include clauses in query as much as possible (even better if you have any indexes on your tables include them too). For now, you can change your query as below:
var translationsQuery = new TableQuery<T>()
.Where(TableQuery.CombineFilters(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey),
TableOperators.Or,
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
),
TableOperators.And,
TableQuery.CombineFilters(
TableQuery.GenerateFilterConditionForDate("affectiveAt", QueryComparisons.LessThan, DateTime.Now),
TableOperators.And,
TableQuery.GenerateFilterConditionForDate("expireAt", QueryComparisons.GreaterThan, DateTime.Now))
));
Because you have a big amount of data to retrieve it's better to run your queries in parallel. So, you should replace do while
loop inside ExecuteQueryAsync
method with Parallel.ForEach
I wrote based on Stephen Toub Parallel.While; This way it will reduce query execution time. This is a good choice because you can remove Result
when you make a call on this method, But it has a little limitation that I'll talk about it after this part of code:
public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
var items = new List<T>();
TableContinuationToken token = null;
Parallel.ForEach(new InfinitePartitioner(), (ignored, loopState) =>
{
TableQuerySegment<T> seg = table.ExecuteQuerySegmented(query, token);
token = seg.ContinuationToken;
items.AddRange(seg);
if (token == null) // It's better to change this constraint by looking at https://www.vivien-chevallier.com/Articles/executing-an-async-query-with-azure-table-storage-and-retrieve-all-the-results-in-a-single-operation
loopState.Stop();
});
return items;
}
And then you can call it in your Get
method:
return table.ExecuteQueryAsync(translationsQuery).Cast<Translation>();
As you can see the method itselft is not async (you should change it's name) and Parallel.ForEach
is not compatible with passing in an async method. This is why I've used ExecuteQuerySegmented
instead. But, to make it more performant and use all the benefits of asynchronous method you can replace the above ForEach
loop with ActionBlock
method in Dataflow or ParallelForEachAsync
extension method from AsyncEnumerator Nuget package.
2.It's a good choice to execute independent parallel queries and then merge the results, even if its performance improvement is at most 10 percent. This gives you time to be able to find the best performance friendly query. But, never forget to include all your constraints in it, and test both ways to know which one better suites your problem.
3. I'm not sure it's a good suggestion or not, But do it and see the results. As described in MSDN:
The Table service enforces server timeouts as follows:
Query operations: During the timeout interval, a query may execute for up to a maximum of five seconds. If the query does not complete within the five-second interval, the response includes continuation tokens for retrieving remaining items on a subsequent request. See Query Timeout and Pagination for more information.
Insert, update, and delete operations: The maximum timeout interval is 30 seconds. Thirty seconds is also the default interval for all insert, update, and delete operations.
If you specify a timeout that is less than the service's default timeout, your timeout interval will be used.
So you can play with timeout and check if there is any performance improvements.
UPDATE 06-30-2021
Thanks to @WouterVanRanst for close looking into the above snippet, I decided to update it and use another overload of Parallel.ForEach
method, make the loop single threaded and prevent race condition on TableContinuationToken
. You can find the descriptions about partition-local variables with an example here on MSDN. Here's the new look of ExecuteQueryAsync<T>
method:
public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
TableContinuationToken token = null;
var items = new List<T>();
Parallel.ForEach(new InfinitePartitioner(), () =>
{
return null as TableQuerySegment<T>;
}, (ignored, loopState, segment) =>
{
segment = table.ExecuteQuerySegmented(query, token) as TableQuerySegment<T>;
token = segment.ContinuationToken;
if (token == null)
loopState.Stop();
return segment;
},
(seg) => items.AddRange(seg)
);
return items;
}
NOTE: Of course you can polish the code above or find a better approach to prevent race condition, but it's a simple one in no time. I'll be glad to hear your thoughts on it.
So the secret is not only in the code but also in setting up your Azure storage tables.
a) One of the prominent options to optimize your queries in Azure is to introduce caching. This will drastically reduce your overall response times and thereby avoiding bottleneck during the peak hour you have mentioned.
b) Also, When querying entities out of Azure, the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds. So ensure you use both PartitionKey & RowKey.
See more details here : https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-design-for-query
Hope this helps.
var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
return over1000Results.Where(x => x.expireAt > DateTime.Now)
.Where(x => x.effectiveAt < DateTime.Now);
Here is one of the problems, you are running the query and then filtering it from memory using these "wheres". Move the filters to before the query runs which should help a lot.
Second you must provide some limit of rows to retrieve from database
Unfortunately, below query introduces a full table scan:
TableQuery<T> treanslationsQuery = new TableQuery<T>()
.Where(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
, TableOperators.Or,
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
)
);
You should split it into two Partition Key filters and query them separately, which will become two partition scans and perform more efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With