How can I choose the right key-value store for my use case?

Question

I will describe the data and case.

record {
    customerId: "id", <---- indexed
    binaryData: "data" <---- not indexed
}

Expectations:

customerId is random 10 digit number
Average size of binary record data - 1-2 kilobytes
There may be up to 100 records per one customerId
Overall number of records - 500M
Write pattern #1: insert one record at a time
Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
Search pattern #1: find all records by customerId
Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
Data is not too important, we can trade some aspects of reliability for speed
We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
We want to spend no more that 1K USD per month on cloud costs for this solution

What we have tried:

We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:

input data is sorted by customerId ASC
Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.

Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:

IOPS is not at the top use (~4K per second),
CPU use is not high,
Network is not fully utilized,
Write and read throughputs are not capped.

Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.

Peter Corless · Accepted Answer

So if I understand you...

2kb * 500m records ≈ 1 TB of data
20m writes/hr ≈ 5.5k writes/sec

That's quite doable in NoSQL.

The scale is not the issue. It's your cost.

$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.

Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)

I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.

As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).

You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.

Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.

Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.

Still over your budget, but closer. If you could get reserved pricing, that might just make it.

Edit: Found the reserve pricing:

3x i3.2xlarge is going to cost you

At monthly pricing $312.44 x 3 = $937.32, or
1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.

So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.

But the admin burden is on your team (and their time isn't exactly zero cost).

For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)

Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.

All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in – see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).

To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.

Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.

How can I choose the right key-value store for my use case?

Tags:

nosql

key-value

key-value-store

snowindy

1 Answers

Peter Corless

Recent Activity

Donate For Us

How can I choose the right key-value store for my use case?

Tags:

nosql

key-value

key-value-store

snowindy

1 Answers

Peter Corless

Related questions

Recent Activity

Donate For Us