Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to speed up a query with partitionkey in azure table storage

How do we increase the speed of this query?

We have approximately 100 consumers within the span of 1-2 minutes executing the following query. Each one of these runs represents 1 run of a consumption function.

        TableQuery<T> treanslationsQuery = new TableQuery<T>()
         .Where(
          TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
           , TableOperators.Or,
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
          )
         );

This query will yield approximately 5000 results.

Full code:

    public static async Task<IEnumerable<T>> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
    {
        var items = new List<T>();
        TableContinuationToken token = null;

        do
        {
            TableQuerySegment<T> seg = await table.ExecuteQuerySegmentedAsync(query, token);
            token = seg.ContinuationToken;
            items.AddRange(seg);
        } while (token != null);

        return items;
    }

    public static IEnumerable<Translation> Get<T>(string sourceParty, string destinationParty, string wildcardSourceParty, string tableName) where T : ITableEntity, new()
    {
        var acc = CloudStorageAccount.Parse(Environment.GetEnvironmentVariable("conn"));
        var tableClient = acc.CreateCloudTableClient();
        var table = tableClient.GetTableReference(Environment.GetEnvironmentVariable("TableCache"));
        var sourceDestinationPartitionKey = $"{sourceParty.ToLowerTrim()}-{destinationParty.ToLowerTrim()}";
        var anySourceDestinationPartitionKey = $"{wildcardSourceParty}-{destinationParty.ToLowerTrim()}";

        TableQuery<T> treanslationsQuery = new TableQuery<T>()
         .Where(
          TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
           , TableOperators.Or,
            TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
          )
         );

        var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
        return over1000Results.Where(x => x.expireAt > DateTime.Now)
                           .Where(x => x.effectiveAt < DateTime.Now);
    }

During these executions, when there are 100 consumers, as you can see the requests will cluster and form spikes:

enter image description here

During these spikes, the requests often take over 1 minute:

enter image description here

How do we increase the speed of this query?

like image 427
Alex Gordon Avatar asked Nov 04 '19 21:11

Alex Gordon


People also ask

What is the fastest way to query entities out of azure?

When querying entities out of Azure the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds.

What is Windows Azure Storage partitioning?

Windows Azure storage partitions. All Windows Azure storage abstractions (Blob, Table, Queue) are built upon the same stack (whitepaper here). While there’s much more to tell about it, the reason why it scales is because of its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system.

How does partitionkey work in table storage?

In Table Storage, you have to decide on the PartitionKey yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use.

How does a database know which partition to query?

By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on RowKey within that partition. Less fast but still fast enough will be querying by specifying PartitionKey: table storage will know which partition to query.


4 Answers

There is 3 things you can consider:

1. First of all, get rid of your Where clauses that you perform on the query result. It's better to include clauses in query as much as possible (even better if you have any indexes on your tables include them too). For now, you can change your query as below:

var translationsQuery = new TableQuery<T>()
.Where(TableQuery.CombineFilters(
TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey),
    TableOperators.Or,
    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
    ),
TableOperators.And,
TableQuery.CombineFilters(
    TableQuery.GenerateFilterConditionForDate("affectiveAt", QueryComparisons.LessThan, DateTime.Now),
    TableOperators.And,
    TableQuery.GenerateFilterConditionForDate("expireAt", QueryComparisons.GreaterThan, DateTime.Now))
));

Because you have a big amount of data to retrieve it's better to run your queries in parallel. So, you should replace do while loop inside ExecuteQueryAsync method with Parallel.ForEach I wrote based on Stephen Toub Parallel.While; This way it will reduce query execution time. This is a good choice because you can remove Result when you make a call on this method, But it has a little limitation that I'll talk about it after this part of code:

public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
    var items = new List<T>();
    TableContinuationToken token = null;

    Parallel.ForEach(new InfinitePartitioner(), (ignored, loopState) =>
    {
        TableQuerySegment<T> seg = table.ExecuteQuerySegmented(query, token);
        token = seg.ContinuationToken;
        items.AddRange(seg);

        if (token == null) // It's better to change this constraint by looking at https://www.vivien-chevallier.com/Articles/executing-an-async-query-with-azure-table-storage-and-retrieve-all-the-results-in-a-single-operation
            loopState.Stop();
    });

    return items;
}

And then you can call it in your Get method:

return table.ExecuteQueryAsync(translationsQuery).Cast<Translation>();

As you can see the method itselft is not async (you should change it's name) and Parallel.ForEach is not compatible with passing in an async method. This is why I've used ExecuteQuerySegmented instead. But, to make it more performant and use all the benefits of asynchronous method you can replace the above ForEach loop with ActionBlock method in Dataflow or ParallelForEachAsync extension method from AsyncEnumerator Nuget package.

2.It's a good choice to execute independent parallel queries and then merge the results, even if its performance improvement is at most 10 percent. This gives you time to be able to find the best performance friendly query. But, never forget to include all your constraints in it, and test both ways to know which one better suites your problem.

3. I'm not sure it's a good suggestion or not, But do it and see the results. As described in MSDN:

The Table service enforces server timeouts as follows:

  • Query operations: During the timeout interval, a query may execute for up to a maximum of five seconds. If the query does not complete within the five-second interval, the response includes continuation tokens for retrieving remaining items on a subsequent request. See Query Timeout and Pagination for more information.

  • Insert, update, and delete operations: The maximum timeout interval is 30 seconds. Thirty seconds is also the default interval for all insert, update, and delete operations.

If you specify a timeout that is less than the service's default timeout, your timeout interval will be used.

So you can play with timeout and check if there is any performance improvements.

UPDATE 06-30-2021

Thanks to @WouterVanRanst for close looking into the above snippet, I decided to update it and use another overload of Parallel.ForEach method, make the loop single threaded and prevent race condition on TableContinuationToken. You can find the descriptions about partition-local variables with an example here on MSDN. Here's the new look of ExecuteQueryAsync<T> method:

public static IEnumerable<T> ExecuteQueryAsync<T>(this CloudTable table, TableQuery<T> query) where T : ITableEntity, new()
{
    TableContinuationToken token = null;
    var items = new List<T>();

    Parallel.ForEach(new InfinitePartitioner(), () =>
    {
        return null as TableQuerySegment<T>;
    }, (ignored, loopState, segment) =>
    {
        segment = table.ExecuteQuerySegmented(query, token) as TableQuerySegment<T>;
        
        token = segment.ContinuationToken;

        if (token == null)
            loopState.Stop();

        return segment;
    },
    (seg) => items.AddRange(seg)
    );

    return items;
}

NOTE: Of course you can polish the code above or find a better approach to prevent race condition, but it's a simple one in no time. I'll be glad to hear your thoughts on it.

like image 172
MasLoo Avatar answered Oct 12 '22 08:10

MasLoo


So the secret is not only in the code but also in setting up your Azure storage tables.

a) One of the prominent options to optimize your queries in Azure is to introduce caching. This will drastically reduce your overall response times and thereby avoiding bottleneck during the peak hour you have mentioned.

b) Also, When querying entities out of Azure, the fastest possible way to do that is with both the PartitionKey and RowKey. These are the only indexed fields in Table Storage and any query that utilises both of these will be returned in a matter of a few milliseconds. So ensure you use both PartitionKey & RowKey.

See more details here : https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-design-for-query

Hope this helps.

like image 45
S.S.Prabhu Avatar answered Oct 06 '22 00:10

S.S.Prabhu


  var over1000Results = table.ExecuteQueryAsync(treanslationsQuery).Result.Cast<Translation>();
        return over1000Results.Where(x => x.expireAt > DateTime.Now)
                           .Where(x => x.effectiveAt < DateTime.Now);

Here is one of the problems, you are running the query and then filtering it from memory using these "wheres". Move the filters to before the query runs which should help a lot.

Second you must provide some limit of rows to retrieve from database

like image 3
Miranda Avatar answered Oct 12 '22 07:10

Miranda


Unfortunately, below query introduces a full table scan:

    TableQuery<T> treanslationsQuery = new TableQuery<T>()
     .Where(
      TableQuery.CombineFilters(
        TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sourceDestinationPartitionKey)
       , TableOperators.Or,
        TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, anySourceDestinationPartitionKey)
      )
     );

You should split it into two Partition Key filters and query them separately, which will become two partition scans and perform more efficiently.

like image 2
Zhaoxing Lu Avatar answered Oct 12 '22 08:10

Zhaoxing Lu