Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Auto scale up/down Cosmos DB RU's

We experience throttling (429) due to burst of high traffic for a period of time. To mitigate this issue, we currently increase the RU in azure portal, and decrease it later.

I want to scale Up/Down based on the metrics, but, it does not expose the # of physical partitions created for the document DB container.

  • How can I get the # of physical partitions for a document DB container?
  • If someone in this group has solved auto scaling problem, I’m eager to know how?
like image 504
Saravanan Avatar asked Feb 06 '18 23:02

Saravanan


People also ask

How do you Autoscale a Cosmos DB?

Navigate to your Azure Cosmos DB account and open the Data Explorer tab. Select New Container. Enter a name for your database, container, and a partition key. Under database or container throughput, select the Autoscale option, and set the maximum throughput (RU/s) that you want the database or container to scale to.

Is Cosmos DB scalable?

Azure Cosmos DB scales the throughput T such 0.1*Tmax <= T <= Tmax . For example, if you set the maximum throughput to 20,000 RU/s, the throughput will scale between 2000 to 20,000 RU/s. Because scaling is automatic and instantaneous, at any point in time, you can consume up to the provisioned Tmax with no delay.

What is horizontal scaling in Cosmos DB?

Azure Cosmos DB distributes your data across logical and physical partitions based on your partition key to enable horizontal scaling. As data gets written, Azure Cosmos DB uses the hash of the partition key value to determine which logical and physical partition the data lives on.

What is _etag in Cosmos DB?

Every item stored in an Azure Cosmos DB container has a system defined _etag property. The value of the _etag is automatically generated and updated by the server every time the item is updated.


2 Answers

What to base the desired RU limit on

I would not go to physical partition level at all as the load probably does not distribute evenly across partitions anyway. I assume you probably don't care about average partition throughput but need to take care of the worst one.

So, if you need full auto-scale, then I would concentrate on tracking throttling events (occurs after the fact) or monitoring the total RU usage (partitioning magic). Both paths can go really complex to get true auto-scale and probably a combination of those would be needed. While upscaling seems achievable then deciding when to come back down and to what level is trickier.

It is hard to expect the unexpected and reliably react to things before they happen. Definitely consider if it's worth it in your scenario compared to simpler solutions.

Calendar-based RU limit baseline

An even simpler solution would be to just set the RU limit by a prepared schedule (i.e. weekday + time of day) following the average peak load trends.

This will not autoscale for unexpected peaks or fall-offs and would require some monitoring to adjust to the unexpected, but you have that anyway, right? What such simple solution would give you, is a flexible throughput limit and predictable cost for the average day, with minimal effort.

Changing RU limit

Once you know WHAT RU limit you want at any given time, then executing it is easy. The increasing-decreasing or RU limit could be programmed and for example ran through Azure functions. C# example for actually changing the limit would be along the lines of:

var offer = client.CreateOfferQuery()
    .Where(r => r.ResourceLink == collection.SelfLink).Single();
offer = new OfferV2(offer, newthroughput);
client.ReplaceOfferAsync(offer);

Your Azure function could tick periodically and depending on your configured schedule or gathered events adjust the newthroughput accordingly.

A note of caution

Whatever autoscale solution you implement, do think about setting reasonable hard limits for how high you are willing to go. Otherwise you could get unexpected bills from Azure in case of mishaps or malicious activity(DDOS). Having throttling is better at some point.

like image 119
Imre Pühvel Avatar answered Sep 27 '22 21:09

Imre Pühvel


https://github.com/giorgited/CosmosScale

I wrote this library to help the with autoscaling. We were using azure functions to do the scaling in the morning and scale back down at night but realized that it was not as efficient.

The above mentioned library will scale up to the maximum of the desired RU provided by the user and scale back down when there is no inactivity. It handles bulk operations differently then single operations, see the github for the full info, including the benchmark stats.

Disclaimer: I am an author of this library.

like image 35
scorpion5211 Avatar answered Sep 27 '22 21:09

scorpion5211