Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) - Loads of independent, structured data to be processed and queried

Hi there at SO,

I would like some ideas/comments on the following from you honorable and venerable bunch.

I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.

As I see it I have two main options:

Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...

Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".

Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.

Many Thanks

like image 763
MalteseUnderdog Avatar asked Feb 03 '23 22:02

MalteseUnderdog


2 Answers

Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:

  1. Keep the 100M in MySQL. Take a look at Cloudera's Sqoop utility to import records from the database and process them in Hadoop.
  2. If MySQL is the bottleneck in (1), consider setting up slave replication, which will let you parallelize reads, without the complexity of a sharded database. Since you've already stated that you don't need to write back to the database, this should be a viable solution. You can replicate your data to as many servers as needed.
  3. If you are running complex select queries from the database, and (2) is still not viable, then consider using Sqoop to import your records and do whatever query transformations you require in Hadoop.

In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.

like image 50
bajafresh4life Avatar answered Feb 05 '23 16:02

bajafresh4life


There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.

like image 29
David Gruzman Avatar answered Feb 05 '23 15:02

David Gruzman