Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.

Now, if we assume large databases with partitioning where different object types can be:

  1. Completely different fields (so no common field for partitioning)
  2. Related (through reference)

How to organize things so that things that should go together end up in same partition?

For example, lets say we have:

User

BlogPost

BlogPostComment

If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition? Is there some best practice for this?

[UPDATE] Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them? For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?

like image 741
dee zg Avatar asked Mar 07 '23 17:03

dee zg


1 Answers

I've written about this somewhat extensively in other similar questions regarding Cosmos.

Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.

This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.

Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.

UPDATE: For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.

like image 57
Jesse Carter Avatar answered Apr 27 '23 00:04

Jesse Carter