Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosmos Db Graph - Performance and throughput of Gremlin.Net vs Microsoft.Graph

As I'm learning how to use graph with Cosmos DB, I found two Microsoft tutorials:

  • One using Gremlin.Net
  • The other using Microsoft.Azure.Graph pre-release

While I use the same query, its execution differs.

Using Gremlin.Net, it executes at once. I very often (I'd say 70% of the time) get a RequestRateTooLargeException. If I understand correctly, it means that I keep reaching the 400RU/s limit that I chose to start with. However, when the query goes trough, it is twice as fast a the solution with Microsoft.Azure.Graph.

Indeed, with Micorosft.Azure.Graph, I have to call ExecuteNextAsync in a loop which returns one result at a time.

So the questions are:
1°) Which method should I use for better performance?
2°) How can I know the RU of my query so I can fine tune it?
3°) Is it possible to increase the throughput of an existing collection?

Update

Re question 3, I found that in the "Data Explorer" blade of my database, there is a "Scale & Settings" for my graph where I can update the throughput.

Update2

Re question 2, we can't get the RU charged when using the first method (Gremlin.Net) but the Microsoft.Graph the method ExecuteNextAsync returns a FeedResponse with a field RequestCharge.

like image 492
François Avatar asked Feb 26 '18 17:02

François


People also ask

What is Cosmos DB throughput?

You specify the exact amount of throughput you need and Azure Cosmos DB guarantees the configured throughput, backed by SLA. You can start with a minimum throughput of 400 RU/sec and scale up to tens of millions of requests per second or even more.

Is Cosmosdb fast?

Azure Cosmos DB is a fast and flexible distributed database that scales seamlessly with guaranteed latency and throughput. You don't have to make major architecture changes or write complex code to scale your database with Azure Cosmos DB.

How do I check my cosmos DB performance?

Monitor from Azure Cosmos DB portal: You can monitor with the metrics available within the Metrics tab of the Azure Cosmos DB account. The metrics on this tab include throughput, storage, availability, latency, consistency, and system level metrics. By default, these metrics have a retention period of seven days.

Which unit is used to measure throughput in Cosmosdb?

Provisioned throughput mode: In this mode, you provision the number of RUs for your application on a per-second basis in increments of 100 RUs per second.


1 Answers

The reason you are hitting RequestRateTooLarge exceptions (429 status code) via Gremlin.NET vs Microsoft.Azure.Graphs is likely due to the difference between the retry policy on CosmosDB Gremlin server vs the default retry policy for DocumentClient.

The default retry behavior for DocumentClient with regards to these errors is described here:

By default, the DocumentClientException with status code 429 is returned after a cumulative wait time of 30 seconds if the request continues to operate above the request rate.

Therefore, Microsoft.Azure.Graphs may be internally handling and retrying these errors from the server and eventually succeeding. This has the benefit of improving request reliability but obfuscates the request rate errors, and will impact execution duration.

On CosmosDB Gremlin server, this retry policy allowance is reduced significantly, so RequestRateTooLargeException errors will be surfaced sooner.

To answer your questions:

1. Which method should I use for better performance?

Using CosmosDB Gremlin server via Gremlin.NET is expected to see better performance. Microsoft.Azure.Graphs uses a different request processing approach which involves more round-trips to the server so it has overhead, in addition to being a number of releases behind what is deployed to the server.

2. How can I know the RU of my query so I can fine tune it?

RU charges will be included in the Gremlin server responses as attributes. Currently Gremlin.NET doesn't have a way of exposing attributes on the response, however changes to the client driver are being discussed here.

In the interim, you an observe how frequently your requests hit 429 errors through the Metrics blade on your Azure CosmosDB Account portal. This presents aggregated views of number of requests, requests that exceeded capacity, storage quota etc. for a given collection.

3. Is it possible to increase the throughput of an existing collection?

As you found, you can increase throughput for an existing collection via the portal. Alternatively, this can be programmatically via Microsoft.Azure.Documents SDK.


In closing, my recommendation would be to add a retry policy around Gremlin.NET requests to handle these exceptions and match on RequestRateTooLargeException message.

When response status attributes are exposed on Gremlin.NET, they will include:

  • Request charge,
  • CosmosDB specific status code (eg. 429), and
  • Retry-after value, which indicates the time to wait in order to avoid hitting 429 errors.
like image 118
Oliver Towers Avatar answered Oct 03 '22 04:10

Oliver Towers