As I'm learning how to use graph with Cosmos DB, I found two Microsoft tutorials:
While I use the same query, its execution differs.
Using Gremlin.Net, it executes at once. I very often (I'd say 70% of the time) get a RequestRateTooLargeException
. If I understand correctly, it means that I keep reaching the 400RU/s limit that I chose to start with. However, when the query goes trough, it is twice as fast a the solution with Microsoft.Azure.Graph.
Indeed, with Micorosft.Azure.Graph, I have to call ExecuteNextAsync
in a loop which returns one result at a time.
So the questions are:
1°) Which method should I use for better performance?
2°) How can I know the RU of my query so I can fine tune it?
3°) Is it possible to increase the throughput of an existing collection?
Update
Re question 3, I found that in the "Data Explorer" blade of my database, there is a "Scale & Settings" for my graph where I can update the throughput.
Update2
Re question 2, we can't get the RU charged when using the first method (Gremlin.Net) but the Microsoft.Graph the method ExecuteNextAsync
returns a FeedResponse
with a field RequestCharge
.
You specify the exact amount of throughput you need and Azure Cosmos DB guarantees the configured throughput, backed by SLA. You can start with a minimum throughput of 400 RU/sec and scale up to tens of millions of requests per second or even more.
Azure Cosmos DB is a fast and flexible distributed database that scales seamlessly with guaranteed latency and throughput. You don't have to make major architecture changes or write complex code to scale your database with Azure Cosmos DB.
Monitor from Azure Cosmos DB portal: You can monitor with the metrics available within the Metrics tab of the Azure Cosmos DB account. The metrics on this tab include throughput, storage, availability, latency, consistency, and system level metrics. By default, these metrics have a retention period of seven days.
Provisioned throughput mode: In this mode, you provision the number of RUs for your application on a per-second basis in increments of 100 RUs per second.
The reason you are hitting RequestRateTooLarge
exceptions (429 status code) via Gremlin.NET vs Microsoft.Azure.Graphs is likely due to the difference between the retry policy on CosmosDB Gremlin server vs the default retry policy for DocumentClient.
The default retry behavior for DocumentClient with regards to these errors is described here:
By default, the DocumentClientException with status code 429 is returned after a cumulative wait time of 30 seconds if the request continues to operate above the request rate.
Therefore, Microsoft.Azure.Graphs may be internally handling and retrying these errors from the server and eventually succeeding. This has the benefit of improving request reliability but obfuscates the request rate errors, and will impact execution duration.
On CosmosDB Gremlin server, this retry policy allowance is reduced significantly, so RequestRateTooLargeException
errors will be surfaced sooner.
To answer your questions:
1. Which method should I use for better performance?
Using CosmosDB Gremlin server via Gremlin.NET is expected to see better performance. Microsoft.Azure.Graphs uses a different request processing approach which involves more round-trips to the server so it has overhead, in addition to being a number of releases behind what is deployed to the server.
2. How can I know the RU of my query so I can fine tune it?
RU charges will be included in the Gremlin server responses as attributes. Currently Gremlin.NET doesn't have a way of exposing attributes on the response, however changes to the client driver are being discussed here.
In the interim, you an observe how frequently your requests hit 429 errors through the Metrics blade on your Azure CosmosDB Account portal. This presents aggregated views of number of requests, requests that exceeded capacity, storage quota etc. for a given collection.
3. Is it possible to increase the throughput of an existing collection?
As you found, you can increase throughput for an existing collection via the portal. Alternatively, this can be programmatically via Microsoft.Azure.Documents SDK.
In closing, my recommendation would be to add a retry policy around Gremlin.NET requests to handle these exceptions and match on RequestRateTooLargeException
message.
When response status attributes are exposed on Gremlin.NET, they will include:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With