I am trying to understand the working of the <code>reduceByKey</code> in Spark using java as the programming language. Say I have a sentence "I am who I am". I break the sentence into words and store it as a list <code>[I, am, who, I, am]</code>. Now this function assigns <code>1</code> to each word: <pre class="prettyprint"><code>JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); </code></pre> So the output is something like this: <pre class="prettyprint"><code>(I,1) (am,1) (who,1) (I,1) (am,1) </code></pre> Now if I have 3 reducers running, each reducer will get a key and the values associated with that key: <pre class="prettyprint"><code>reducer 1: (I,1) (I,1) reducer 2: (am,1) (am,1) reducer 3: (who,1) </code></pre> I wanted to know a. What exactly happens here in the function below. b. What are the parameters <code>new Function2<Integer, Integer, Integer></code> c. Basically how the JavaPairRDD is formed. <pre class="prettyprint"><code>JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); </code></pre>

I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many. This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1 (e.g. two integers sum to one). This is used to reduce all values for each key to 1. It's not necessary to provide an N-to-1 function since it can be accomplished with a 2-to-1 function. Here, you can't emit multiple values for one key. The result are (key, reduced value) from each (key, bunch of values). The Mapper and Reducer in classic Hadoop MapReduce were actually both quite similar (just that one takes a collection of values rather than single value per key) and let you implement a lot of patterns. In a way that's good, in a way that was wasteful and complex. You can still reproduce what Mappers and Reducers do, but the method in Spark is mapPartitions, possibly paired with groupByKey. These are the most general operations you might consider, and I'm not saying you should emulate MapReduce this way in Spark. In fact it's unlikely to be efficient. But it is possible.

The reduceByKey works as below: in an RDD , if spark finds elements having same key, then spark takes their values and performs certain operations on those values, and returns the same type of value. for eg, let us take, you have and RDD with elements: [k,V1], [K,V2], here V1, V2 are f same type then the arguments to new Function2() could be three. <ol> <li>from the value part of first K,V pair i.e V1.</li> <li>from the value part of second K,V pair i.e V2.</li> <li>the return type for the overridden call method which is again of type V1 and V2 (which can be the result of the function operation provided as part of call method).</li> </ol> and note that As RDD's are distributed across nodes, each node will perform their own reduce operation, and return the result to master, and the master again performs the final reduce operation on the results of workers. I guess this explains your query.

Apache Spark - reducebyKey - Java -

Tags:

java

apache-spark

I am trying to understand the working of the reduceByKey in Spark using java as the programming language.

Say I have a sentence "I am who I am". I break the sentence into words and store it as a list [I, am, who, I, am].

Now this function assigns 1 to each word:

JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) {
        return new Tuple2<String, Integer>(s, 1);
    }
});

So the output is something like this:

(I,1) 
(am,1)
(who,1)
(I,1)
(am,1)

Now if I have 3 reducers running, each reducer will get a key and the values associated with that key:

reducer 1:
    (I,1)
    (I,1)

reducer 2:
    (am,1)
    (am,1)

reducer 3:
    (who,1)

I wanted to know

a. What exactly happens here in the function below.
b. What are the parameters new Function2<Integer, Integer, Integer>
c. Basically how the JavaPairRDD is formed.

JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
    }
});

388

asked Aug 02 '14 02:08

user641887

2 Answers

I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many.

This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1 (e.g. two integers sum to one). This is used to reduce all values for each key to 1. It's not necessary to provide an N-to-1 function since it can be accomplished with a 2-to-1 function. Here, you can't emit multiple values for one key.

The result are (key, reduced value) from each (key, bunch of values).

The Mapper and Reducer in classic Hadoop MapReduce were actually both quite similar (just that one takes a collection of values rather than single value per key) and let you implement a lot of patterns. In a way that's good, in a way that was wasteful and complex.

You can still reproduce what Mappers and Reducers do, but the method in Spark is mapPartitions, possibly paired with groupByKey. These are the most general operations you might consider, and I'm not saying you should emulate MapReduce this way in Spark. In fact it's unlikely to be efficient. But it is possible.

answered Sep 25 '22 10:09

Sean Owen

The reduceByKey works as below:

in an RDD , if spark finds elements having same key, then spark takes their values and performs certain operations on those values, and returns the same type of value. for eg, let us take, you have and RDD with elements:

[k,V1], [K,V2], here V1, V2 are f same type then the arguments to new Function2() could be three.

from the value part of first K,V pair i.e V1.
from the value part of second K,V pair i.e V2.
the return type for the overridden call method which is again of type V1 and V2 (which can be the result of the function operation provided as part of call method).

and note that As RDD's are distributed across nodes, each node will perform their own reduce operation, and return the result to master, and the master again performs the final reduce operation on the results of workers.

I guess this explains your query.

answered Sep 22 '22 10:09

napster

Related questions
                            
                                How to create mutually dependent objects safely?
                            
                                Arrays sort method behavior
                            
                                ChannelHandler is not a sharable Handler
                            
                                JavaEE - WAR - Deployed :: How to read a file from resource directory [duplicate]
                            
                                Scala StackOverflowError while Java can handle it
                            
                                WildFly - getting resource from WAR
                            
                                Functional differences between .class, .jar and .java (class) file in eclipse? [closed]
                            
                                Ruby equivalent of Java's Collections.unmodifiableList and Collections.unmodifiableMap
                            
                                Compiling GWT 2.6.1 at Java 7 source level
                            
                                Make a random mac address generator generate just unicast macs
                            
                                Error creating bean with name defined in class path resource [application-context.xml] in spring framework
                            
                                Jsoup http logging
                            
                                Best practices for accessing private static final values in unit tests
                            
                                Best way to organize java code... multiple projects or separate by packages?
                            
                                Versioned Object/JSON Mapping to/from Mongo?
                            
                                how to catch the event of shutting down of tomcat?
                            
                                How does JVM handles RuntimeException(s)
                            
                                How do I create an Android Test Module in IntelliJ 13 for a Gradle Android project?
                            
                                TreeMap high-low key Integer sort
                            
                                Is it a bad practice to include non-public fields into equals() [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With