Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy lifting. The same machine is running Apache as well.

The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.

Relational features I would need -

  • Order by [SliceRange in Cassandra's API seems to satisy this]
  • Group by
  • Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
  • Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]

My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).

So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,

  • On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.

  • Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]

  • If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.

  • Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)

Any insights from people who've done a shift would be greatly appreciated!

Thanks.

like image 339
viksit Avatar asked Feb 25 '10 06:02

viksit


People also ask

What are the advantages and disadvantages of Cassandra?

It's open-source. It follows peer-to-peer architecture rather than master-slave architecture, so there isn't a single point of failure. Cassandra can be easily scaled down or up. It features data replication, so it's fault-tolerant and has high availability.

When use Cassandra vs MySQL?

Most businesses use Cassandra for write-heavy workloads in the field of Data Science whereas MySQL is preferred for all other types of workloads. Hopefully, this would give you the knowledge to choose the right database according to your needs.

Can Cassandra replace SQL?

The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale. While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. It fits long-term database needs if you're applying it to the right use case.

When should you not use Cassandra?

Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case.


1 Answers

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.

However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.

Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.

That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)

There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

like image 108
jbellis Avatar answered Sep 28 '22 06:09

jbellis