Hi there at SO,
I would like some ideas/comments on the following from you honorable and venerable bunch.
I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.
As I see it I have two main options:
Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...
Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".
Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.
Many Thanks
Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:
In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.
There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With