Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is mongoDB or Cassandra better than MySQL for large datasets?

In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.

Concurrently, customer service operates on a replicated slave of the same database.

We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).

We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.

Given all this, what is our best option for a database?

like image 804
Ben Overmyer Avatar asked Dec 15 '11 14:12

Ben Overmyer


2 Answers

I think you're going the wrong way about the problem.

Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).

Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.

What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point). You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).

Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).

Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.

like image 152
N.B. Avatar answered Oct 05 '22 23:10

N.B.


Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...

... so ...

... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.

If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.

like image 41
Mario Avatar answered Oct 06 '22 01:10

Mario