Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Key-value stores for medium to large values

We have a system that stores (single-digit) millions of images, varying in size from 8KB to 500KB, median around 15KB, average 30KB. The total data set is currently around 100GB. We want to access the image based upon a hash of the image (this can be changed, but it needs to be computable from the image for the purpose of checking whether an image is already in the data store efficiently — images are processed such that two images are pixel-for-pixel identical iff they are byte-for-byte identical). Persistence is (obviously) important.

At the moment we store them all as files within a directory — the listing of the directory are cached by the kernel, and actual file-reads are done as needed. As I understand it, the main advantage of key-value stores (versus using a filesystem as one) is reading smaller values, as the whole page can be cached, instead of just a single value. All the access currently comes from the web server (on an intranet) on the same server as the data, though we may move to checking whether keys exist from remote machines (mostly connected through 10GbE).

There isn't any particular reason to change it, though with other major parts of the system changing, it seems worthwhile to re-consider the current approach.

Given a workload whose reading is primarily (single) reads in insertion order and random (though quite possibly repeated) accesses to arbitrary keys, in addition to frequent writes (something of the order of magnitude 1:10 write:read), is there likely to be much advantage to moving to a key-value store from the filesystem?

like image 845
gsnedders Avatar asked Nov 18 '11 13:11

gsnedders


People also ask

Which is the best example for key-value store?

A telephone directory is a good example, where the key is the person or business name, and the value is the phone number. Stock trading data is another example of a key-value pair.

What are key-value database types?

A key-value database is a type of nonrelational database that uses a simple key-value method to store data. A key-value database stores data as a collection of key-value pairs in which a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects.

Where are key-value stores used?

Key-value stores are used for use cases where applications will require values to be retrieved fast via keys, like maps or dictionaries in programming languages.

What key values we store in NoSQL database?

The keys are unique identifiers for the values. The values can be any type of object -- a number or a string, or even another key-value pair in which case the structure of the database grows more complex. Unlike relational databases, key-value databases do not have a specified structure.


1 Answers

Summary: For your requirements of Data Integrity, Persistence, Size & Speed I recommend Redis.

A nice intro presentation can be seen here:
https://simonwillison.net/static/2010/redis-tutorial/

n.b. More info would help but based on what you've given + what I know, here are some of the main players:

Memcached:
https://memcached.org/
A free, open source, high-performance, distributed memory object caching system, good for speeding up dynamic web applications.
+ good for web applications, free, open source.
- if the server goes down (memcached process failure or system reboot) all sessions are lost. Performance limitations at the higher (commercial usage) levels.

Redis:
https://redis.io/
Similar to memcached but with data persistence, supports multiple value types, counters with atomic increment/decrement and built-in key expiration.
+ saves data to disk so never lost, very simple, speed, flexibility (keys can contain strings, hashes, lists, sets and sorted sets), sharding, maintained by vmware rather than an individual.
- limited clustering.

LevelDB:
https://google-opensource.blogspot.com/2011/07/leveldb-fast-persistent-key-value-store.html
A fast key-value storage engine written at Google that maps string keys to string values.
+ Google.
- ?possible with Google + ;)

TokoyoCabinet:
https://fallabs.com/tokyocabinet/
Includes support for locking, ACID transactions, a binary array data type.
+ Speed and efficiency.
- Less known in some areas, e.g. US

Project Voldemort:
https://project-voldemort.com/
An advanced key-value store, written in Java. Provides multi-version concurrency control (MVCC) for updates. Updates to replicas are done asynchronously, so it does not guarantee consistent data.
+ Functionality
- Conistency

MongoDB:
https://www.mongodb.org/
A scalable, high-performance, open source, document-oriented database. Written in C++ Features Replication & High Availability with mirrors across LANs and WANs and Auto-Sharding. Popular in the Ruby on Rails community.
+ Easy installation, good documentation, support.
- Relatively new.

Couch:
http://www.couchdb.org/
Similar to Mongo, aimed at document databases.
+ replication, advanced queries.
- clustering, disk space management.

Cassandra:
https://cassandra.apache.org/
Apache Cassandra is fault-tolerant and decentralized and is used at Netflix, Twitter and Reddit, among others.
+ Cluster and replication.
- More setup knowledge needed.

I can't provide all the references, due to lack of time but hope this at least helps.

like image 98
Michael Durrant Avatar answered Oct 11 '22 13:10

Michael Durrant