Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Fast Data Storage & Retrieval

Tags:

java

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:

  1. Extremely fast retrieval and insertion
  2. Each record will have a unique key. This key will be used to retrieve the record
  3. The data stored should be persistent i.e. should be available upon JVM restart
  4. A separate process would move stale records to RDBMS once a day

What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

like image 377
AAK Avatar asked Oct 15 '09 14:10

AAK


4 Answers

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.

For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.

For persistency, we have to write our data to disk after all. Supposing optimal organization this can be done as background activity, not affecting latency. However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.

Data model is not restricting here: most of the methods support access based on a unique key.

We have to decide,

  • if data fits in the memory of one machine, or we have to find distributed solutions,
  • if concurrency is an issue, or there are no parallel operations,
  • if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.

Solutions might be

  • self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
  • Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
  • NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
    • BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
    • Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
    • ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
    • MemcacheDB, Redis other caching databases with persistency,
    • popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.

A list of NoSQL tools can be found eg. here.

Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

like image 85
csaba Avatar answered Oct 19 '22 03:10

csaba


Have a look at LinkedIn's Voldemort.

like image 43
fvu Avatar answered Oct 19 '22 01:10

fvu


If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

like image 25
Dean J Avatar answered Oct 19 '22 02:10

Dean J


What about something like CouchDB?

like image 4
Mark Avatar answered Oct 19 '22 02:10

Mark