Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NoSql or MySQL for Data Analytics

Tags:

mysql

nosql

hive

We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week).

All these data need to be made available for Analytics.

We have a Msyql solution with star schema(only parts of data is loaded on to this). But

concern is how far one can stretch this ?

Should I be looking at NoSQL like Hive for data analytics ??

I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive

How big is big Data, and when should I be looking away from MySQL? Will the structural rigidness of Mysql cause problems ?

Currently the data is only few GB(in MySQL), But it certainly will grow. How about MySQL clustering ??

Should I be going down this path at all ??

like image 815
AlgoMan Avatar asked Oct 15 '11 21:10

AlgoMan


People also ask

Is SQL or NoSQL better for analytics?

NoSQL seems to work better on both unstructured and unrelated data. The better solutions are the crossover databases that have elements of both NoSQL and SQL. RDBMSs that use SQL are schema–oriented which means the structure of the data should be known in advance to ensure that the data adheres to the schema.

Is NoSQL good for data analytics?

NoSQL databases like MongoDB offer superior benefits when dealing with big data over SQL because of their flexible schema requirements. However, SQL databases have been traditionally favored by most data managers for data analysis. Especially because most BI tools (e.g. Looker) will not let you query NoSQL databases.

Why is NoSQL better for analytics?

Storing capacity of large volumes of unstructured data: A NoSQL database can store unlimited sets of data with any types. Moreover, it has the user flexibility to change the data type on the go. It is a document based database. Hence, no need to define the data type in advance.


4 Answers

350Gb (growing couple of GB a week)... All these data need to be made available for Analytics

Do you have MySQL gurus in house? If yes, sure => just create and grow that MySQL cluster. The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation.

Something that can become a problem later with MySQL solution is Oracle. So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for.

If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions.

For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple.

The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual, and if you strictly doing analytics, for you, consistency may not be a price at all.

In return you get a very natural Big Data handling, fault tolerance and much more.

If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt, which promises 5 times Hive performance.

like image 85
tolitius Avatar answered Oct 21 '22 20:10

tolitius


The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. Scaling up on one box or out on multiple servers is supported...

like image 25
drive-by poster Avatar answered Oct 21 '22 19:10

drive-by poster


You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms

Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (e.g. scaling, read speed, the problems with requiring 100% consistency, etc.).

like image 30
jefflunt Avatar answered Oct 21 '22 21:10

jefflunt


InfiniDB is not free.

Check out http://code.google.com/p/shard-query

This is like Map-Reduce over a sharded shared-nothing set of databases. Works great for STAR schemas. Shard the fact table over N nodes and duplicate the dimension tables on each server.

You can check out this blog post for more info and performance testing results:

http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/

FYI: I'm the author of Shard-Query.

like image 34
Justin Swanhart Avatar answered Oct 21 '22 19:10

Justin Swanhart