Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CosmosDB - DocumentDB - Bulk insert without saturating collection RU

I am investigating using Azure CosmosDB for an application that would require high read throughput, and the ability to scale. 99% of the activity would be reads, but occasionally we would need to insert somewhere from just a few documents to potentially a batch of a few million.

I have created a collection to test with and provisioned 2500 RU/sec. However I am running into issues with inserting even just 120 small (500 bytes) documents (I get "request rate is large" error).

How can I possibly use document db in any useful way, if any time I want to insert some documents it will use all my RU and prevent anyone from reading it?

Yes, I can increase the RUs provisioned, but if I only need 2500 for reads, I don't want to have to pay for 10000 just for the occasional insert.

Reads need to be as fast as possible, ideally in the "single-digit-millisecond" range that Microsoft advertises. The inserts do not need to be as fast as possible, but faster is better.

I have tried using a stored procedure which I have seen suggested, but that also fails to insert all reliably, I have tried creating my own bulk insert method using multiple threads as suggested in the answer here but this produces very slow results and also often errors for at least some documents, and seems to average a RU rate of well below what I've provisioned.

I feel like I must be missing something, do I have to massively over provision RU just for writes? Is there some kind of functionality built in to limit the RU use for inserting? How is it possible to insert hundreds of thousands of documents in a reasonable amount of time, and without making the collection unusuable?

like image 345
QTom Avatar asked Aug 11 '17 10:08

QTom


1 Answers

Performing bulk inserts of millions of documents is possible under certain circumstances. We just went through an exercise at my company of moving 100M records from various tables in an Azure SQL DB to CosmosDb.

  • It's very important to understand CosmosDb partitions. Choosing a good partition key that spreads your data out among partitions is critical to get the kind of throughput you're looking for. Each partition has a maximum RU/s throughput of 10k. If you're trying to shove all of your data into a single partition, it doesn't matter how many RU/s you provision, because anything above 10k is wasted (assuming nothing else is going on for your container).
  • Also, each logical partition has a max size of 20GB. Once you hit 20GB in size, you'll get errors if you attempt to add more records. Yet another reason to choose your partition key wisely.
  • Use Bulk Insert. Here's a great video that offers a walkthrough. With the latest NuGet package, it's surprisingly easy to use. I found this video to be a much better explanation than what's on learn.microsoft.com.

Edit CosmosDb now has Autoscale. With Autoscale enabled, your Collection will remain at a lower provisioned RU/s, and will automatically scale up to a max threshold when under load. This will save you a ton of money with your specified use case. We've been using this feature since it went GA.

If the majority of your ops are reads, look into Integrated Cache. As of right now, it's in public preview. I haven't played with this, but it can save you money if your traffic is read-heavy.

like image 53
Rob Reagan Avatar answered Nov 01 '22 15:11

Rob Reagan