Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Hadoop a good candidate for use as a key-value store?

Question

Would Hadoop be a good candidate for the following use case:

  • Simple key-value store (primarily needs to GET and SET by key)
  • Very small "rows" (32-byte key-value pairs)
  • Heavy deletes
  • Heavy writes
  • On the order of a 100 million to 1 billion key-value pairs
  • Majority of data can be contained on SSDs (solid state drives) instead of in RAM.

More info

The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.

Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.

We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.

like image 654
Chris Dutrow Avatar asked Sep 23 '14 22:09

Chris Dutrow


1 Answers

Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.

If you care about "NoSql DB's" that are on top of Hadoop:

  • HBase would be suited for heavy writes, but sucks on huge deletes
  • Cassandra same story, but writes are not as fast as in HBase
  • Accumulo might be useful for very frequent updates, but will suck on deletes as well

None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.

All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.

What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.

From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.

BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.

like image 129
Thomas Jungblut Avatar answered Oct 29 '22 17:10

Thomas Jungblut