Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bitcask ok for simple and high performant file store?

Tags:

java

file

xml

riak

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.

Our requirements are:

  1. Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
  2. Very fast random lookup by id (e.g. document URL)
  3. Accessible by both Java and Perl
  4. Available on the most important Linux-Distros and Windows

I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:

  1. No clustering required
  2. No daemon ("service") required
  3. No clever search functionality required

Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)

Soo my question boils down to

  • is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
  • are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
  • (for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
like image 700
KoW Avatar asked May 15 '11 13:05

KoW


2 Answers

I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.

The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.


Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.

like image 145
Stephen C Avatar answered Sep 29 '22 07:09

Stephen C


Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.

Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.

I am not sure on the status of a Java version/wrapper.

like image 37
Eric Brewer Avatar answered Sep 29 '22 06:09

Eric Brewer