We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week).
All these data need to be made available for Analytics.
We have a Msyql solution with star schema(only parts of data is loaded on to this). But
concern is how far one can stretch this ?
Should I be looking at NoSQL like Hive for data analytics ??
I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive
How big is big Data, and when should I be looking away from MySQL? Will the structural rigidness of Mysql cause problems ?
Currently the data is only few GB(in MySQL), But it certainly will grow. How about MySQL clustering ??
Should I be going down this path at all ??
NoSQL seems to work better on both unstructured and unrelated data. The better solutions are the crossover databases that have elements of both NoSQL and SQL. RDBMSs that use SQL are schema–oriented which means the structure of the data should be known in advance to ensure that the data adheres to the schema.
NoSQL databases like MongoDB offer superior benefits when dealing with big data over SQL because of their flexible schema requirements. However, SQL databases have been traditionally favored by most data managers for data analysis. Especially because most BI tools (e.g. Looker) will not let you query NoSQL databases.
Storing capacity of large volumes of unstructured data: A NoSQL database can store unlimited sets of data with any types. Moreover, it has the user flexibility to change the data type on the go. It is a document based database. Hence, no need to define the data type in advance.
350Gb (growing couple of GB a week)... All these data need to be made available for Analytics
Do you have MySQL gurus in house? If yes, sure => just create and grow that MySQL cluster. The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation.
Something that can become a problem later with MySQL solution is Oracle. So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for.
If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions.
For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple.
The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual, and if you strictly doing analytics, for you, consistency may not be a price at all.
In return you get a very natural Big Data handling, fault tolerance and much more.
If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt, which promises 5 times Hive performance.
The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. Scaling up on one box or out on multiple servers is supported...
You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms
Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (e.g. scaling, read speed, the problems with requiring 100% consistency, etc.).
InfiniDB is not free.
Check out http://code.google.com/p/shard-query
This is like Map-Reduce over a sharded shared-nothing set of databases. Works great for STAR schemas. Shard the fact table over N nodes and duplicate the dimension tables on each server.
You can check out this blog post for more info and performance testing results:
http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/
FYI: I'm the author of Shard-Query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With