Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are queries against Azure Table Storage indexed when using a partial RowKey?

I understand from the MS PDC presentations that the PartitionKey is used to load balance the table across multiple servers, but nobody seems to give any advice on whether the PartitionKey is used as an index WITHIN a single server.

Likewise, everyone will tell you that specifying the PartitionKey AND the RowKey gets you great performance, but nobody seems to tell you if the RowKey is being used to improve performance WITHIN a PartitionKey.

Here are some sample queries to help me frame the questions. Assume the entire table contains 100,000,000 rows.

  1. PartionKey="123" and OtherField="def"
  2. PartitionKey="123" and RowKey >= "aaa" and RowKey < "aac"

Here are my questions:

  • If I have only 10 rows in each Partition, would Query 1 be fast?
  • If I have 1,000,000 rows in each Partition, would Query 2 be fast?
like image 675
Doug Clutter Avatar asked Feb 09 '11 14:02

Doug Clutter


1 Answers

In ATS, PartitionKey is used as a distribution lookup, not an index. From the level of working with ATS, just consider PartitionKey and "server"/node to share a 1:1 relationship. (Behind the scenes this isn't true, but concepts such as optimizing PartitionKeys that happen to reside on the same physical/virtual node are abstracted several levels from what a consumer of Azure has to deal with. Those details are purely internal to the overall Azure infrastructure and in the case of ATS, its best to assume that is an optimal as it can be ... aka "dont worry about it")

In the context of a DBMS vs ATS, RowKey is the closest thing to an "index" in that it assists in finding data across a similar node. To directly answer one of your question, RowKey is the index within the PartitionKey.

Stepping outside the box a bit, however, PartitionKey can give you perf gains closer to how you think of a traditional index, but only because of the distributed nature of how your data is spread across ATS nodes. You should optimize layout 1st to the PartitionKey, then to the RowKey. (aka, if you only have one keyable value, make it the PartKey)

In general rule, queries are going to perform in this order, from most efficient to least efficient

1. PartitionKey=x and RowKey=y (and OtherProp = z)

because the lookup gets to the right node and then to an indexed prop on the partition

2. PartitionKey=x (and OtherProp =z)

because you get to the proper node, but then to the ATS equvi. of a full table scan

3. OtherProp = z

because you have to a partition scan, then a table scan


With that, to your direct questions

  1. I don't feel this can be answered. Its subjective (ie "what is fast?"). It will always be slower than Query2, but with 10 rows that "slowness" is likely milliseconds if even

  2. (similar theme) It will be faster than Query 1. Anytime you can do Query2, you should

So with that explaination and your questions, the real answer comes down to how your architect your usage of ATS.

Based on your data set (both current and expected growth) you need to determine a proper scheme so that you can get to your Partition AND to your Row is the fastest way possible. Knowing how the lookup occurs, you can make logical decisions as to what path is going to get you there fast enough, more parts, less rows -vs- less parts, more rows, etc

like image 138
Taylor Bird Avatar answered Sep 30 '22 04:09

Taylor Bird