Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosmos DB Graph Edge partitioning

Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following:

We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents.

Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships.

An example of one of these relationships is a Follow. For instance UserX can Follow UserY. To realize this relationship, we created a Gremlin query that creates an Edge:

    g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}')
    .addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}'))

The resulting Edge automatically gets assigned to the partition of UserX, because UserX is the out-vertex.

When querying on outgoing edges (all the users that UserX is following), all is fine and well because the query is limited to the partition for UserX.

    g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV()

However when inverting the query (find all followers of UserY), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query:

    g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV()

In my opinion a full cross-partition query with 100 million partitions is unacceptable.

I have tried putting the Edge between UserX and UserY inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API)

Now I have come to the point of implementing a pair of edges between UserX and UserY, one outgoing Edge for UserX and one outgoing Edge for UserY, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency.

Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?

like image 964
cldons Avatar asked Nov 21 '17 13:11

cldons


People also ask

How does Cosmos DB enable horizontal partitioning?

In Azure Cosmos DB, partitioning is what allows you to massively scale your database, not just in terms of storage but also throughput. You simply create a container in your database, and let Cosmos DB partition the data you store in that container, to manage its growth.

Does Cosmos DB support graph database?

It is a multi-model database and supports document, key-value, graph, and column-family data models. Azure Cosmos DB provides a graph database service via the Gremlin API on a fully managed database service designed for any scale.

How many partitions does Cosmos DB have?

There is no limit to the number of logical partitions in your container. Each logical partition can store up to 20GB of data. Good partition key choices have a wide range of possible values.

How do you graph a Cosmos DB?

In your Azure Cosmos DB account in the Azure portal, select Data Explorer, expand sample-graph, select Graph, and then select Apply Filter. In the Results list, notice the new users added to the graph.


2 Answers

I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows.

In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.

like image 188
Jayanta Mondal Avatar answered Oct 13 '22 14:10

Jayanta Mondal


This is a classic partitioning dilemma, not unique to Cosmos/Graph.

If your usage pattern is lots of queries with small scope then cross-partition is bad. If it is returning large data sets then cross-partition overhead is probably insignificant against the benefits of parallelism. Unless you have a constant high volume of queries then I think the cross-partition overhead is overstated (MS seem to think everyone is building the next Facebook on Cosmos).

In the OP case you can optimise for x follows y, or x is followed by y, or both by having an edge each way. Note that RUs are reserved on a per partition basis (i.e. total RU / number of partitions) so to use them efficiently you need either high volume, evenly distributed, single partition queries or queries that span multiple partitions.

like image 41
Ian Bennett Avatar answered Oct 13 '22 15:10

Ian Bennett