I'm evaluating a storage platform for an upcoming project and keep coming back to Cassandra. For this project loosing any amount of data is unacceptable. So far we've used a relational database (Microsoft SQL Server), but the data is so varied and large that it has become an issue to store and query.
Is Cassandra robust enough to use as a primary data store? Or should it only be used to mirror existing data to speed up access?
Anecdotally: yes, Twitter, Digg, Ooyala, SimpleGeo, Mahalo, and others are using or moving to Cassandra for a primary data store (http://n2.nabble.com/Cassandra-users-survey-td4040068.html).
Technically: yes; besides supporting replication (including to multiple datacenters), each Cassandra node has an fsync'd commit log to make sure writes are durable; from there writes are turned into SSTables which are immutable until compaction (which combines multiple SSTables to GC old versions). Snapshotting is supported at any time, including automatic snapshot-before-compaction.
Whether to use Cassandra for your application or not depends purely on your data workloads. Cassandra is optimised for write-intensive workloads, therefore, it is suitable for applications where a large amount of data needs to be inserted (such as infrastructure logging information at Facebook).
If however, you require fast retrievals and insertion speed is not an issue, then perhaps you should have a look at say HBase (which is optimised of read-intensive workloads).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With