Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dynamically horizontal scalable key value store

Is there a key value store that will give me the following:

  • Allow me to simply add and remove nodes and will redstribute the data automatically
  • Allow me to remove nodes and still have 2 extra data nodes to provide redundancy
  • Allow me to store text or images up to 1GB in size
  • Can store small size data up to 100TB of data
  • Fast (so will allow queries to be performed on top of it)
  • Make all this transparent to the client
  • Works on Ubuntu/FreeBSD or Mac
  • Free or open source

I basically want something I can use a "single", and not have to worry about having memcached, a db, and several storage components so yes, I do want a database "silver bullet" you could say.

Thanks

Zubair

Answers so far: MogileFS on top of BackBlaze - As far as I can see this is just a filesystem, and after some research it only seems to be appropriate for large image files

Tokyo Tyrant - Needs lightcloud. This doesn't auto scale as you add new nodes. I did look into this and it seems it is very fast for queries which fit onto a single node though

Riak - This is one I am looking into myself, but I don't have any results yet

Amazon S3 - Is anyone using this as their sole persistance layer in production? From what I have seen it seems to be used for storage of images as complex queries are too expensive

@shaman suggested Cassandra - definitely one I am looking into

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

like image 205
yazz.com Avatar asked Jan 19 '10 09:01

yazz.com


People also ask

What is horizontal scaling database?

What is Horizontal Scaling? Horizontal scaling, also known as scale-out, refers to bringing on additional nodes to share the load. This is difficult with relational databases due to the difficulty in spreading out related data across nodes.

What is the most scalable database?

Apache Cassandra is the most established scalable massive database. It an Open source NoSQL key-value database that provides low latency, it is fault tolerant(using replicas), scalable and decentralized; meaning it does not follow a master-slave pattern to provide high availability.

Why MongoDB is scalable?

Why is MongoDB scalable? As a NoSQL database, MongoDB is scalable as its data is not coupled relationally. Data is stored as JSON-like documents which are self-contained. This allows those documents to be easily distributed across multiple nodes through horizontal scaling.

What makes a database scalable?

What Makes a Scalable Database? Database scalability is a concept in analytics database design that emphasizes the capability of a database to handle growth in the amount of data and users. In the modern applications sphere, two types of workloads have emerged – namely analytical and transactional workloads.


4 Answers

You are asking too much from open source software.

If you have a couple hundred thousand dollars in your budget for some enterprise class software, there are a couple of solutions. Nothing is going to do what you want out of the box, but there are companies that have products which are close to what you are looking for.

"Fast (so will allow queries to be performed on top of it)"

If you have a key-value store, everything should be very fast. However the problem becomes that without an ontology or data schema built on top of the key-value store, you will end up going through the whole database for each query. You need an index containing the key for each "type" of data you want to store.

In this case, you can usually perform queries in parallel against all ~15,000 machines. The bottleneck is that cheap hard drives cap out at 50 seeks per second. If your data set fits in RAM, your performance will be extremely high. However, if the keys are stored in RAM but there is not enough RAM for the values to be stored, the system will goto disc on almost all key-value lookups. The keys are each located at random positions on the drive.

This limits you to 50 key-value lookups per second per server. Whereas when the key-value pairs are stored in RAM, it is not unusual to get 100k operations per second per server on commodity hardware (ex. Redis).

Serial disc read performance is however extremely high. I have seek drives goto 50 MB/s (800 Mb/s) on serial reads. So if you are storing values on disc, you have to structure the storage so that the values that need to be read from disc can be read serially.

That is the problem. You cannot get good performance on a vanilla key-value store unless you either store the key-value pairs completely in RAM (or keys in RAM with values on SSD drives) or if you define some type of schema or type system on top of the keys and then cluster the data on disc so that all keys of a given type can be retrieved easily through a serial disc read.

If a key has multiple types (for example if you have data-type inheritance relationships in the database), then the key will be an element of multiple index tables. In this case, you will have to make time-space trade offs to structure the values so that they can be read serially from disc. This entails storing redundant copies of the value for the key.

What you want is going to be a bit more advanced than a key-value store, especially if you intend to do queries. The problem of storing large files however is a non-problem. Pretend your system can keys upto 50 meg. Then you just break up a 1 gig file into 50 meg segments and associate a key to each segment value. Using a simple server it is straight forward to translate the part of the file you want into a key-value lookup operation.

The problem of achieving redundancy is more difficult. Its very easy to "fountain code" or "part file" the key-value table for a server, so that the server's data can be reconstructed at wire speed (1 Gb/s) onto a standby server, if a particular server dies. Normally, you can detect server death using a "heart beat" system which is triggered if the server does not respond for 10 seconds. It is even possible to key-value lookups against the part-file encoded key-value tables, but it is inefficient to do so but still gives you a backup for the event of server failure. A bigger issues it is almost impossible to keep the backup up to date and the data may be 3 minutes old. If you are doing lots of writes, the backup functionality is going to introduce some performance overhead, but the overhead will be negligible if your system is primarily doing reads.

I am not an expert on maintaining database consistency and integrity constraints under failure modes, so I am not sure what problems this requirement would introduce. If you do not have to worry about this, it greatly simplifies the design of the system and its requirements.

Fast (so will allow queries to be performed on top of it)

First, forget about joins or any operation that scales faster than n*log(n) when your database is this large. There are two things you can do to replace the functionality normally implemented with joins. You can either structure the data so that you do not need to do joins or you can "pre-compile" the queries you are doing and make a time-space trade-off and pre-compute the joins and store them for lookup in advance.

For semantic web databases, I think we will be seeing people pre-compiling queries and making time-space trade-offs in order to achieve decent performance on even modestly sized datasets. I think that this can be done automatically and transparently by the database back-end, without any effort on the part of the application programmer. However we are only starting to see enterprise databases implementing these techniques for relational databases. No open source product does it as far as I am aware and I would surprised if anyone is trying to do this for linked data in horizontally scalable databases yet.

For these types of systems, if you have extra RAM or storage space the best use of it is to pre-compute and store the result of common sub-queries for performance reasons, instead of adding more redundancy to the key-value store. Pre-compute results and order by the keys you are going to query against to turn an n^2 join into a log(n) lookup. Any query or sub-query that scales worse than n*log(n) is something whose results need to be performed and cached in the key-value store.

If you are doing a large number of writes, the cached sub-queries will be invalidated faster than they can be processed and there is no performance benefit. Dealing with cache invalidation for cached sub-queries is another intractable problem. I think a solution is possible, but I have not seen it.

Welcome to hell. You should not expect to get a system like this for free for another 20 years.

So far it seems that there is no database or key value store that fulfills the criteria I mentioned, not even after offering a bounty of 100 points did the question get answered!

You are asking for a miracle. Wait 20 years until we have open source miracle databases or you should be willing to pay money for a solution customized to your application's needs.

like image 84
HaltingState Avatar answered Oct 30 '22 22:10

HaltingState


Amazon S3 is a storage solution, not a database.

If you only need simple key/value your best bet would be to use Amazon SimpleDB in combination with S3. Large files are stored on S3, while meta data for searching is stored in SimpleDB. this gives you a horizontally scalable key/value system with direct access to S3.

like image 38
Great Turtle Avatar answered Oct 30 '22 20:10

Great Turtle


There's another solution, which seems to be exactly what you are looking for: The Apache Cassandra project: http://incubator.apache.org/cassandra/

At the moment twitter is switching to Cassandra from memcached+mysql cluster

like image 30
Illarion Kovalchuk Avatar answered Oct 30 '22 20:10

Illarion Kovalchuk


HBase and HDFS together fulfil most of these requirements. HBase can be used to store and retrieve small objects. HDFS can be used to store large objects. HBase compacts small objects and stores them as larger ones on HDFS. Speed is relative - HBase is not as fast on random reads from disk as mysql (for example) - but is pretty fast serving reads from memory (similar to Cassandra). It has excellent write performance. HDFS, the underlying storage layer, is fully resilient to loss of multiple nodes. It replicates across racks as well allowing rack level maintenance. It's a Java based stack with Apache license - runs pretty much most OS.

The main weaknesses of this stack are less than optimal random disk read performance and lack of cross data center support (which is a work in progress).

like image 26
Joydeep Sen Sarma Avatar answered Oct 30 '22 20:10

Joydeep Sen Sarma