Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following: We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents. Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships. An example of one of these relationships is a Follow. For instance <code>UserX</code> can Follow <code>UserY</code>. To realize this relationship, we created a Gremlin query that creates an <code>Edge</code>: <pre class="prettyprint"><code> g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}') .addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}')) </code></pre> The resulting <code>Edge</code> automatically gets assigned to the partition of <code>UserX</code>, because <code>UserX</code> is the out-vertex. When querying on outgoing edges (all the users that <code>UserX</code> is following), all is fine and well because the query is limited to the partition for <code>UserX</code>. <pre class="prettyprint"><code> g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV() </code></pre> However when inverting the query (find all followers of <code>UserY</code>), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query: <pre class="prettyprint"><code> g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV() </code></pre> In my opinion a full cross-partition query with 100 million partitions is unacceptable. I have tried putting the <code>Edge</code> between <code>UserX</code> and <code>UserY</code> inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API) Now I have come to the point of implementing a pair of edges between <code>UserX</code> and <code>UserY</code>, one outgoing <code>Edge</code> for <code>UserX</code> and one outgoing <code>Edge</code> for <code>UserY</code>, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency. Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?

I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows. In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.

Cosmos DB Graph Edge partitioning

Tags:

graph

azure

gremlin

azure-cosmosdb

Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following:

We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents.

Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships.

An example of one of these relationships is a Follow. For instance UserX can Follow UserY. To realize this relationship, we created a Gremlin query that creates an Edge:

    g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}')
    .addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}'))

The resulting Edge automatically gets assigned to the partition of UserX, because UserX is the out-vertex.

When querying on outgoing edges (all the users that UserX is following), all is fine and well because the query is limited to the partition for UserX.

    g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV()

However when inverting the query (find all followers of UserY), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query:

    g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV()

In my opinion a full cross-partition query with 100 million partitions is unacceptable.

I have tried putting the Edge between UserX and UserY inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API)

Now I have come to the point of implementing a pair of edges between UserX and UserY, one outgoing Edge for UserX and one outgoing Edge for UserY, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency.

Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?

964

asked Nov 21 '17 13:11

cldons

2 Answers

I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows.

In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.

188

answered Oct 13 '22 14:10

Jayanta Mondal

This is a classic partitioning dilemma, not unique to Cosmos/Graph.

If your usage pattern is lots of queries with small scope then cross-partition is bad. If it is returning large data sets then cross-partition overhead is probably insignificant against the benefits of parallelism. Unless you have a constant high volume of queries then I think the cross-partition overhead is overstated (MS seem to think everyone is building the next Facebook on Cosmos).

In the OP case you can optimise for x follows y, or x is followed by y, or both by having an edge each way. Note that RUs are reserved on a per partition basis (i.e. total RU / number of partitions) so to use them efficiently you need either high volume, evenly distributed, single partition queries or queries that span multiple partitions.

answered Oct 13 '22 15:10

Ian Bennett

Related questions
                            
                                How to sum very large numbers to 1 in python
                            
                                Possible reasons why ParallelQuery.Aggregate does not run in parallel
                            
                                Raw query and row level access control over multiple models in Django
                            
                                How to connect to SSL enabled Oracle database using SQL Developer
                            
                                Angular Routes - Avoiding Hardcoded Strings
                            
                                502 error when redirecting stream from another site
                            
                                JSON web token auth logic with refresh tokens
                            
                                How does `auto` interract with biconditional (iff)
                            
                                Room ORM enum type converter error
                            
                                Displaying type and value with Reason
                            
                                Single channel png displayed with colors
                            
                                How to search a block of code as a whole

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With