I am trying to understand the working of the reduceByKey
in Spark using java as the programming language.
Say I have a sentence "I am who I am".
I break the sentence into words and store it as a list [I, am, who, I, am]
.
Now this function assigns 1
to each word:
JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
So the output is something like this:
(I,1)
(am,1)
(who,1)
(I,1)
(am,1)
Now if I have 3 reducers running, each reducer will get a key and the values associated with that key:
reducer 1:
(I,1)
(I,1)
reducer 2:
(am,1)
(am,1)
reducer 3:
(who,1)
I wanted to know
a. What exactly happens here in the function below.
b. What are the parameters new Function2<Integer, Integer, Integer>
c. Basically how the JavaPairRDD is formed.
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
Spark reduceByKey Function In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.
PySpark reduceByKey() transformation is used to merge the values of each key using an associative reduce function on PySpark RDD. It is a wider transformation as it shuffles data across multiple partitions and It operates on pair RDD (key/value pair).
Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Spark RDD reduceByKey function merges the values for each key using an associative reduce function. The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated.
I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many.
This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1 (e.g. two integers sum to one). This is used to reduce all values for each key to 1. It's not necessary to provide an N-to-1 function since it can be accomplished with a 2-to-1 function. Here, you can't emit multiple values for one key.
The result are (key, reduced value) from each (key, bunch of values).
The Mapper and Reducer in classic Hadoop MapReduce were actually both quite similar (just that one takes a collection of values rather than single value per key) and let you implement a lot of patterns. In a way that's good, in a way that was wasteful and complex.
You can still reproduce what Mappers and Reducers do, but the method in Spark is mapPartitions, possibly paired with groupByKey. These are the most general operations you might consider, and I'm not saying you should emulate MapReduce this way in Spark. In fact it's unlikely to be efficient. But it is possible.
The reduceByKey works as below:
in an RDD , if spark finds elements having same key, then spark takes their values and performs certain operations on those values, and returns the same type of value. for eg, let us take, you have and RDD with elements:
[k,V1], [K,V2], here V1, V2 are f same type then the arguments to new Function2() could be three.
and note that As RDD's are distributed across nodes, each node will perform their own reduce operation, and return the result to master, and the master again performs the final reduce operation on the results of workers.
I guess this explains your query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With