I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs. Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles. Might make a good job for making simple things simple, but complex things.. I find them extremely hard Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach? In what scenario should I chose cascading over the classic approach? Any one using it and happy?

I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why: <ul> <li>A lot of the boilerplate code used to start a job is already written for you.</li> <li>Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.</li> <li>I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.</li> <li>The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.</li> <li>Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.</li> </ul> While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).

does anyone find Cascading for Hadoop Map Reduce useful?

2 Answers

Keeping in mind I'm the author of Cascading...

My suggestion is to use Pig or Hive if they make sense for your problem, Pig especially.

But if you are in the business of data, and not just poking around your data for insights, you will find the Cascading approach makes much more sense for most problems than raw MapReduce.

Your first obstacle with raw MapReduce will be thinking in MapReduce. Trivial problems are simple in MapReduce, but its much easier to develop complex applications if you can work with a model that more easily maps to your problem domain (filter this, parse that, sort those, join the rest, etc).

Next you will realize that a normal unit of work in Hadoop consists of multiple MapReduce jobs. Chaining jobs together is a solvable problem but it should not leak into your application domain level code, it should be hidden and transparent.

Further, you will find refactoring and creating re-usable code much harder if you have to continually move functions between mappers and reducers. or from mappers to the previous reducer to get an optimization. Which leads to the issue of brittleness.

Cascading believes in failing fast as possible. The planner attempts to resolve and satisfy dependencies between all those field names before the Hadoop cluster is even engaged in work. This means 90%+ of all issues will be found before waiting hours for your job to find it during execution.

You can alleviate this in raw MapReduce code by creating domain objects like Person or Document, but many applications don't need all the fields down stream. Consider if you needed the average age of all males. You do not want to pay the IO penalty of passing a whole Person around the network when all you need is a binary gender and numeric age.

With fail fast semantics and lazy binding of sinks and sources, it becomes very easy to build frameworks on Cascading that themselves create Cascading flows (which become many Hadoop MapReduce jobs). A project I'm currently involved with ends up with 100's of MapReduce jobs per run, many created on the fly mid run based on feedback from the data being processed. Search for Cascalog to see an example of a Clojure based framework for simply creating complex processes. Or Bixo for a web mining toolkit and framework that's far easier to customize than Nutch.

Finally Hadoop is never used alone, that means your data is always pulled from some external source and pushed to another after processing. The dirty secret about Hadoop is that it is a very effective ETL framework (so its silly to hear ETL vendors talk about using their tools to push/pull data onto/from Hadoop). Cascading eases this pain somewhat by allowing you to write your operations, applications, and unit tests independent of the integration end-points. Cascading is used in production to load systems like Membase, Memcached, Aster Data, Elastic Search, HBase, Hypertable, Cassandra, etc. (Unfortunately not all the adapters have been released by their authors.)

If you will, please send me a list of the issues your are experiencing with the interface. I am constantly looking for better ways to improve the API and documentation, and the user community is always around to help.

200

answered Nov 08 '22 15:11

cwensel

I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:

A lot of the boilerplate code used to start a job is already written for you.
Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.

While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).

answered Nov 08 '22 13:11

Marc

Related questions
                            
                                Map string to guid with Dapper
                            
                                Best way to print the result of a bool as 'false' or 'true' in c?
                            
                                CSS to lay out keys/values legend horizontally.
                            
                                Count and display number of characters in a textbox using Javascript
                            
                                Document.write clears page
                            
                                how to convert rgb color to int in java
                            
                                Logstash configtest
                            
                                Rounded corners of UILabel swift
                            
                                React-Native: Change opacity colour of ImageBackground
                            
                                Why is Tensorflow not recognizing my GPU after conda install?
                            
                                Why does nuxt give me this error with vue-chartjs?
                            
                                Does jQuery or JavaScript have the concept of classes and objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

does anyone find Cascading for Hadoop Map Reduce useful?

Tags:

Federico

People also ask

2 Answers

cwensel

Marc

Recent Activity

Donate For Us