Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DocumentDB PartitionKey and performance

I have a scenario where I store large amounts of third party data for ad-hoc analysis by business users. Most queries against the data will be complicated, using multiple self-joins, projections, and ranges.

When it comes to picking a PartitionKey for use in Azure DocumentDB, I see people advising to use a logical separator such as TenantId, DeviceId, etc.

Given the parallel nature of DocumentDB, however, I was curious how it would handle a PartitionKey based either on some sort of GUID or large integer so that, during large reads, it would be highly parellelized.

With that in mind, I devised a test with two collections:

  1. test-col-1
    • PartitionKey is TenantId with roughly 100 possible values
  2. test-col-2
    • PartitionKey is unique value assigned by third party that follows the pattern "AB1234568". Guaranteed to be globally unique by the third party.

Both collections are set to 100,000 RUs.

In my experiment I loaded both collections with roughly 2,000 documents. Each document is roughly 20 KB in size and is highly denormalized. Each document is an order, which contains multiple jobs, each of which contain users, prices, etc.

Example query:

SELECT
orders.Attributes.OrderNumber,
orders.Attributes.OpenedStamp,
jobs.SubOrderNumber,
jobs.LaborTotal.Amount As LaborTotal,
jobs.LaborActualHours As LaborHours,
jobs.PartsTotal.Amount As PartsTotal,
jobs.JobNumber,
jobs.Tech.Number As TechNumber,
orders.Attributes.OrderPerson.Number As OrderPersonNumber,
jobs.Status
FROM orders
JOIN jobs IN orders.Attributes.Jobs
JOIN tech IN jobs.Techs
WHERE   orders.TenantId = @TentantId
    AND orders.Attributes.Type = 1
    AND orders.Attributes.Status IN (4, 5)";

In my testing I adjusted the following settings:

  1. Default ConnectionPolicy
  2. Best practices ConnectionPolicy
    • ConnectionMode.Direct, Protocol.Tcp
  3. Various MaxDegreeOfParallelism values
  4. Various MaxBufferedItemCount

The collection with the GUID PartitionKey was queried with EnableCrossPartitionQuery = true. I am using C# and the .NET SDK v1.14.0.

In my initial tests with default settings, I found that querying the collection with TentantId as the PartitionKey was faster, with it taking on average 3,765 ms compared to 4,680 ms on the GUID-keyed collection.

When I set the ConnectionPolicy to Direct with TCP, I discovered TenantID collection query times decreased by nearly 1000 ms to an average of 2,865 ms while the GUID collection increased by about 800 ms to an average of 5,492 ms.

Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).


Given these observations, why would you not want to make the PartitionKey as broad a value as possible?

like image 236
Chad Levy Avatar asked Oct 17 '22 12:10

Chad Levy


1 Answers

Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).

MaxDegreeOfParallelism could set the maximum number of concurrent tasks enabled by ParallelOptions instance. As I known, this is a client side parallelism and it would cost your CPU / Memory resources that you have on your site.

Given these observations, why would you not want to make the PartitionKey as broad a value as possible?

For write operations, we could scale across partition keys, in order to use the throughout that you have provisioned. While for read operations, we need to minimize cross-partition lookups for a lower latency.

Also, as this official document mentioned:

The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.

It is a best practice to have a partition key with many distinct values (100s-1000s at a minimum).

To achieve the full throughput of the container, you must choose a partition key that allows you to evenly distribute requests among some distinct partition key values.

For more details, you could refer to How to partition and scale in Azure Cosmos DB and this channel 9 tutorial about Azure DocumentDB Elastic Scale - Partitioning.

like image 145
Bruce Chen Avatar answered Oct 20 '22 22:10

Bruce Chen