I am evaluating what might be the best migration option.
Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).
Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:
Cassandra has the ability to create secondary indexes on other columns than the defined primary key. However, Cassandra will not allow filtering on other columns without a secondary index, while in MongoDB, the query language can filter on non-indexed fields as well.
Hence, the choice between the two depends on how you plan on querying the data. If the required data can be accessed using a single Primary Key, Apache Cassandra would be suitable but if more complex queries to extract specific values in dynamic data is required, MongoDB should be preferred.
In sum, Cassandra is the modern version of the relational database, albeit where data is grouped by column instead of row, for fast retrieval. MongoDB stores records as documents in JSON format. It has a JavaScript shell and a rich set of functions which makes it easy to work with.
Since Cassandra has multi-primary node support, the architectural design of Cassandra enables it to handle many simultaneous writes to more than one node. It will be more write performant than MongoDB which is limited to one writable primary node per replica set. Secondary servers can only be used for reads.
Lots of reads in every query, fewer regular writes
Both databases perform well on reads where the hot data set fits in memory. Both also emphasize join-less data models (and encourage denormalization instead), and both provide indexes on documents or rows, although MongoDB's indexes are currently more flexible.
Cassandra's storage engine provides constant-time writes no matter how big your data set grows. Writes are more problematic in MongoDB, partly because of the b-tree based storage engine, but more because of the multi-granularity locking it does.
For analytics, MongoDB provides a custom map/reduce implementation; Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL). Cassandra also supports use of Spark.
Not worried about "massive" scalability
If you're looking at a single server, MongoDB is probably a better fit. For those more concerned about scaling, Cassandra's no-single-point-of-failure architecture will be easier to set up and more reliable. (MongoDB's global write lock tends to become more painful, too.) Cassandra also gives a lot more control over how your replication works, including support for multiple data centers.
More concerned about simple setup, maintenance and code
Both are trivial to set up, with reasonable out-of-the-box defaults for a single server. Cassandra is simpler to set up in a multi-server configuration since there are no special-role nodes to worry about.
If you're presently using JSON blobs, MongoDB is an insanely good match for your use case, given that it uses BSON to store the data. You'll be able to have richer and more queryable data than you would in your present database. This would be the most significant win for Mongo.
I've used MongoDB extensively (for the past 6 months), building a hierarchical data management system, and I can vouch for both the ease of setup (install it, run it, use it!) and the speed. As long as you think about indexes carefully, it can absolutely scream along, speed-wise.
I gather that Cassandra, due to its use with large-scale projects like Twitter, has better scaling functionality, although the MongoDB team is working on parity there. I should point out that I've not used Cassandra beyond the trial-run stage, so I can't speak for the detail.
The real swinger for me, when we were assessing NoSQL databases, was the querying - Cassandra is basically just a giant key/value store, and querying is a bit fiddly (at least compared to MongoDB), so for performance you'd have to duplicate quite a lot of data as a sort of manual index. MongoDB, on the other hand, uses a "query by example" model.
For example, say you've got a Collection (MongoDB parlance for the equivalent to a RDMS table) containing Users. MongoDB stores records as Documents, which are basically binary JSON objects. e.g:
{ FirstName: "John", LastName: "Smith", Email: "[email protected]", Groups: ["Admin", "User", "SuperUser"] }
If you wanted to find all of the users called Smith who have Admin rights, you'd just create a new document (at the admin console using Javascript, or in production using the language of your choice):
{ LastName: "Smith", Groups: "Admin" }
...and then run the query. That's it. There are added operators for comparisons, RegEx filtering etc, but it's all pretty simple, and the Wiki-based documentation is pretty good.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With