Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A good blobstore / memcache solution

Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .

Would like to use this pattern for querying on data and storing data:

  • SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
  • Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.

The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.

Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.

Thanks so much!

like image 873
Chris Dutrow Avatar asked Oct 07 '22 07:10

Chris Dutrow


2 Answers

Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).

In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).

The same approach has been also used by others. For instance:

  • Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
  • Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
like image 89
Didier Spezia Avatar answered Oct 08 '22 21:10

Didier Spezia


Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.

CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.

CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.

Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.

Blog post: http://dustin.github.com/2012/09/27/cbfs.html

Project Repository: https://github.com/couchbaselabs/cbfs

like image 44
nirvana Avatar answered Oct 08 '22 19:10

nirvana