Newbie here with Hadoop. Concept wise, it is pretty simple to understand, however, one of the real challenge is how to model the problem to be solved in the map-reduce architecture. Suppose my data contains two parts (all in oracle): 1. Rather static data that doesn't change much 2. Fresh data collected everyday.
and currently the data processing is basically read the fresh data, find and use the corresponding static data (or metadata) and apply some algorithm on it and dump it back to Oracle.
How do I model such application paradigm? Do I save/store the static data as part of distributed cache? What if that data is pretty big?
Basically I am looking for more examples like the following: http://stevekrenzel.com/finding-friends-with-mapreduce
Thanks,
Basically the requirement is to do join on two data sets. MapReduce programming requires a different way of thinking than normal programming. Here are some references to join and some other patterns on top of MapReduce
Data-Intensive Text Processing with MapReduce
MapReduce Design Patterns
Section 8.3 in Hadoop - The Definitive Guide
Coming back to join, it can multiple ways based on the amount of data and how the data is. The above references have more about the same.
We are collecting real life use cases here : http://hadoopilluminated.com/hadoop_book/Hadoop_Use_Cases.html
we already have good coverage of multiple domains, and will continue to add to it.
(disclaimer : I am a co-author of this free hadoop book)
I would look at the following article about Map/Reduce patterns, which should give you a nice idea of common algorithms and their translation in the Map/Reduce world.
More generally, I don't think there's a magical formula to translate a problem into a set of Map/Reduce, you have to ask yourself questions that vary from dataset to dataset, looking at existing examples is a good thing, and you should definitely try to implement something on a little toy problem.
Also if you have issues abstracting your problem to a set of Map/Reduce jobs, you could also use for example Hive which works like a relational database with a few tweaks, and generates Map/Reduce jobs for you without having to worry too much about what happens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With