Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I choose the right key-value store for my use case?

I will describe the data and case.

record {
    customerId: "id", <---- indexed
    binaryData: "data" <---- not indexed
}

Expectations:

  • customerId is random 10 digit number
  • Average size of binary record data - 1-2 kilobytes
  • There may be up to 100 records per one customerId
  • Overall number of records - 500M
  • Write pattern #1: insert one record at a time
  • Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
  • Search pattern #1: find all records by customerId
  • Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
  • Data is not too important, we can trade some aspects of reliability for speed
  • We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
  • We want to spend no more that 1K USD per month on cloud costs for this solution

What we have tried:

We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:

  • input data is sorted by customerId ASC
  • Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.

Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:

  • IOPS is not at the top use (~4K per second),
  • CPU use is not high,
  • Network is not fully utilized,
  • Write and read throughputs are not capped.

Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.

like image 407
snowindy Avatar asked Aug 26 '20 13:08

snowindy


1 Answers

So if I understand you...

  • 2kb * 500m records ≈ 1 TB of data
  • 20m writes/hr ≈ 5.5k writes/sec

That's quite doable in NoSQL.

The scale is not the issue. It's your cost.

$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.

Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)

I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.

As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).

You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.

Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.

Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.

Still over your budget, but closer. If you could get reserved pricing, that might just make it.

Edit: Found the reserve pricing:

3x i3.2xlarge is going to cost you

  • At monthly pricing $312.44 x 3 = $937.32, or
  • 1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.

So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.

But the admin burden is on your team (and their time isn't exactly zero cost).

For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)

Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.

All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in – see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).

To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.

Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.

like image 56
Peter Corless Avatar answered Dec 02 '22 03:12

Peter Corless