I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.
So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product's transformation functionality to different machines, there by, leveraging CPU power from all these machines.
I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product's transformation functionality into MAP and REDUCE functions.
Here's when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:
Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:
Have you used Hadoop in ETL scenarios? If yes, could be specific about how you handled MapReducing of your transformation? Have you used Hadoop purely for leveraging extra CPU power?
Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
If you want to scale-out a processing problem over a lot of systems you must do two things:
If there are dependencies then these will be the limit in your horizontal scalability.
So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the ... when trying to scale-out.
The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.
If you cannot do such a jump you're in trouble.
So coming back to your questions:
# Have you used Hadoop in ETL scenarios?
Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan't put in a normal RDBMS!
# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.
HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want "realtime" processing you are currently out of luck.
I don't know of a better model at this moment that an actual implementation exists for.
# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?
Yep, that is the most common application.
MapReduce is "one" solution for "some" class of problems. It does not solve all the distributed systems problems - think about large TPS systems as the ones in banks or telecoms or telco signaling - there MR might be ineffective. But for the non real-time data processing MR performs awesome and you might consider it for massive ETL.
I cannot answer #1, as I haven't used MapReduce in ETL scenarios. However, I can say that MapReduce is not an "universal answer" for distributed computing; it's a useful tool for handling certain types of situations, where data is structured in a certain way. Think of it like a hashtable; very useful for certain situations, but not an "ultimate algorithm" by any definition of terms.
My personal understanding is that MapReduce is particularly useful for large quantities of "understructured" data; that is, it's useful for imposing some structure (basically, effectively providing a "first order" operation on large unstructured datasets). However, for datasets that are very large and relatively "tightly bound" (i.e. strong association between disparate data elements), it's (in my understanding) not a great solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With