Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark - reducebyKey - Java -

I am trying to understand the working of the reduceByKey in Spark using java as the programming language.

Say I have a sentence "I am who I am". I break the sentence into words and store it as a list [I, am, who, I, am].

Now this function assigns 1 to each word:

JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) {
        return new Tuple2<String, Integer>(s, 1);
    }
});

So the output is something like this:

(I,1) 
(am,1)
(who,1)
(I,1)
(am,1)

Now if I have 3 reducers running, each reducer will get a key and the values associated with that key:

reducer 1:
    (I,1)
    (I,1)

reducer 2:
    (am,1)
    (am,1)

reducer 3:
    (who,1)

I wanted to know

a. What exactly happens here in the function below.
b. What are the parameters new Function2<Integer, Integer, Integer>
c. Basically how the JavaPairRDD is formed.

JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
    }
});
like image 388
user641887 Avatar asked Aug 02 '14 02:08

user641887


People also ask

How do you use reduceByKey in spark?

Spark reduceByKey Function In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

What is the use of reduceByKey?

PySpark reduceByKey() transformation is used to merge the values of each key using an associative reduce function on PySpark RDD. It is a wider transformation as it shuffles data across multiple partitions and It operates on pair RDD (key/value pair).

What is difference between Reduce and reduceByKey in spark?

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

What is reduceByKey in RDD?

Spark RDD reduceByKey function merges the values for each key using an associative reduce function. The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated.


2 Answers

I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many.

This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1 (e.g. two integers sum to one). This is used to reduce all values for each key to 1. It's not necessary to provide an N-to-1 function since it can be accomplished with a 2-to-1 function. Here, you can't emit multiple values for one key.

The result are (key, reduced value) from each (key, bunch of values).

The Mapper and Reducer in classic Hadoop MapReduce were actually both quite similar (just that one takes a collection of values rather than single value per key) and let you implement a lot of patterns. In a way that's good, in a way that was wasteful and complex.

You can still reproduce what Mappers and Reducers do, but the method in Spark is mapPartitions, possibly paired with groupByKey. These are the most general operations you might consider, and I'm not saying you should emulate MapReduce this way in Spark. In fact it's unlikely to be efficient. But it is possible.

like image 69
Sean Owen Avatar answered Sep 25 '22 10:09

Sean Owen


The reduceByKey works as below:

in an RDD , if spark finds elements having same key, then spark takes their values and performs certain operations on those values, and returns the same type of value. for eg, let us take, you have and RDD with elements:

[k,V1], [K,V2], here V1, V2 are f same type then the arguments to new Function2() could be three.

  1. from the value part of first K,V pair i.e V1.
  2. from the value part of second K,V pair i.e V2.
  3. the return type for the overridden call method which is again of type V1 and V2 (which can be the result of the function operation provided as part of call method).

and note that As RDD's are distributed across nodes, each node will perform their own reduce operation, and return the result to master, and the master again performs the final reduce operation on the results of workers.

I guess this explains your query.

like image 37
napster Avatar answered Sep 22 '22 10:09

napster