i am learning spark, but i can't understand this function <code>combineByKey</code>. <pre class="prettyprint"><code>>>> data = sc.parallelize([("A",1),("A",2),("B",1),("B",2),("C",1)] ) >>> data.combineByKey(lambda v : str(v)+"_", lambda c, v : c+"@"+str(v), lambda c1, c2 : c1+c2).collect() </code></pre> The output is: <pre class="prettyprint"><code>[('A', '1_2_'), ('C', '1_'), ('B', '1_2_')] </code></pre> First, i am very confused: where is the <code>@</code> in second step <code>lambda c, v : c+"@"+v</code>? i can't find <code>@</code> from the result. Second, i read the function description for <code>combineByKey</code>, but i am confused the algorithm flow.

The <code>groupByKey</code> call makes no attempt at merging/combining values, so it’s an expensive operation. Thus the <code>combineByKey</code> call is just such an optimization. When using <code>combineByKey</code> values are merged into one value at each partition then each partition value is merged into a single value. It’s worth noting that the type of the combined value does not have to match the type of the original value and often times it won’t be. The <code>combineByKey</code> function takes 3 functions as arguments: <ol> <li>A function that creates a combiner. In the <code>aggregateByKey</code> function the first argument was simply an initial zero value. In <code>combineByKey</code> we provide a function that will accept our current value as a parameter and return our new value that will be merged with additional values.</li> <li>The second function is a merging function that takes a value and merges/combines it into the previously collected values.</li> <li>The third function combines the merged values together. Basically this function takes the new values produced at the partition level and combines them until we end up with one singular value.</li> </ol> In other words, to understand <code>combineByKey</code>, it’s useful to think of how it handles each element it processes. As <code>combineByKey</code> goes through the elements in a partition, each element either has a key it hasn’t seen before or has the same key as a previous element. If it’s a new element, <code>combineByKey</code> uses a function we provide, called <code>createCombiner()</code>, to create the initial value for the accumulator on that key. It’s important to note that this happens the first time a key is found in each partition, rather than only the first time the key is found in the RDD. If it is a value we have seen before while processing that partition, it will instead use the provided function, <code>mergeValue()</code>, with the current value for the accumulator for that key and the new value. Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied <code>mergeCombiners()</code> function. References: <ul> <li>Learning Spark - Chapter 4.</li> <li> Using combineByKey in Apache-Spark blog entry.</li> </ul>

Here is an example of combineByKey. The objective is to find a per key average of the input data. <pre class="prettyprint"><code>scala> val kvData = Array(("a",1),("b",2),("a",3),("c",9),("b",6)) kvData: Array[(String, Int)] = Array((a,1), (b,2), (a,3), (c,9), (b,6)) scala> val kvDataDist = sc.parallelize(kvData,5) kvDataDist: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:26 scala> val keyAverages = kvDataDist.combineByKey(x=>(x,1),(a: (Int,Int),x: Int)=>(a._1+x,a._2+1),(b: (Int,Int),c: (Int,Int))=>(b._1+c._1,b._2+c._2)) keyAverages: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[4] at combineByKey at <console>:25 scala> keyAverages.collect res0: Array[(String, (Int, Int))] = Array((c,(9,1)), (a,(4,2)), (b,(8,2))) scala> val keyAveragesFinal = keyAverages.map(x => (x._1,x._2._1/x._2._2)) keyAveragesFinal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:25 scala> keyAveragesFinal.collect res1: Array[(String, Int)] = Array((c,9), (a,2), (b,4)) </code></pre> combineByKey takes 3 functions as arguments: <ol> <li> Function 1 = createCombiner : Called once per key 'k', in each partition <ul> <li>Input: Value associated with a key 'k' </li> <li>Output: Any desired output type 'O' based on the program logic. This output type will be automatically used further. In this example the output type chosen is (Int,Int). The first element in the Pair sums the values, the second element keeps track of the number of elements that make up the sum.</li> </ul> </li> <li> Function 2 = mergeValue : Called as many times as the occurrence of key 'k' within the partition - 1 <ul> <li>Input: (Output of createCombiner: O, Subsequent value associated with the key 'k' in this partition)</li> <li>Output: (Output: O)</li> </ul> </li> <li> Function 3 = mergeCombiners : Called as many times as the partitions in which the key exists <ul> <li>Input: (Output of mergeValue from Partition X: O, Output of mergeValue from partition Y: O)</li> <li>Output: (Output: O )</li> </ul> </li> </ol>

Who can give a clear explanation for `combineByKey` in Spark?

Tags:

python

apache-spark

i am learning spark, but i can't understand this function combineByKey.

>>> data = sc.parallelize([("A",1),("A",2),("B",1),("B",2),("C",1)] )
>>> data.combineByKey(lambda v : str(v)+"_", lambda c, v : c+"@"+str(v), lambda c1, c2 : c1+c2).collect()

The output is:

[('A', '1_2_'), ('C', '1_'), ('B', '1_2_')]

First, i am very confused: where is the @ in second step lambda c, v : c+"@"+v? i can't find @ from the result.

Second, i read the function description for combineByKey, but i am confused the algorithm flow.

642

asked Nov 26 '15 11:11

BlackMamba

2 Answers

The groupByKey call makes no attempt at merging/combining values, so it’s an expensive operation.

Thus the combineByKey call is just such an optimization. When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value. It’s worth noting that the type of the combined value does not have to match the type of the original value and often times it won’t be. The combineByKey function takes 3 functions as arguments:

A function that creates a combiner. In the aggregateByKey function the first argument was simply an initial zero value. In combineByKey we provide a function that will accept our current value as a parameter and return our new value that will be merged with additional values.
The second function is a merging function that takes a value and merges/combines it into the previously collected values.
The third function combines the merged values together. Basically this function takes the new values produced at the partition level and combines them until we end up with one singular value.

In other words, to understand combineByKey, it’s useful to think of how it handles each element it processes. As combineByKey goes through the elements in a partition, each element either has a key it hasn’t seen before or has the same key as a previous element.

If it’s a new element, combineByKey uses a function we provide, called createCombiner(), to create the initial value for the accumulator on that key. It’s important to note that this happens the first time a key is found in each partition, rather than only the first time the key is found in the RDD.

If it is a value we have seen before while processing that partition, it will instead use the provided function, mergeValue(), with the current value for the accumulator for that key and the new value.

Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied mergeCombiners() function.

References:

Learning Spark - Chapter 4.
Using combineByKey in Apache-Spark blog entry.

108

answered Oct 08 '22 13:10

eliasah

Here is an example of combineByKey. The objective is to find a per key average of the input data.

scala> val kvData = Array(("a",1),("b",2),("a",3),("c",9),("b",6))
kvData: Array[(String, Int)] = Array((a,1), (b,2), (a,3), (c,9), (b,6))

scala> val kvDataDist = sc.parallelize(kvData,5)
kvDataDist: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> val keyAverages = kvDataDist.combineByKey(x=>(x,1),(a: (Int,Int),x: Int)=>(a._1+x,a._2+1),(b: (Int,Int),c: (Int,Int))=>(b._1+c._1,b._2+c._2))
keyAverages: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[4] at combineByKey at <console>:25

scala> keyAverages.collect
res0: Array[(String, (Int, Int))] = Array((c,(9,1)), (a,(4,2)), (b,(8,2)))

scala> val keyAveragesFinal = keyAverages.map(x => (x._1,x._2._1/x._2._2))
keyAveragesFinal: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:25

scala> keyAveragesFinal.collect
res1: Array[(String, Int)] = Array((c,9), (a,2), (b,4))

combineByKey takes 3 functions as arguments:

Function 1 = createCombiner : Called once per key 'k', in each partition
- Input: Value associated with a key 'k'
- Output: Any desired output type 'O' based on the program logic. This output type will be automatically used further. In this example the output type chosen is (Int,Int). The first element in the Pair sums the values, the second element keeps track of the number of elements that make up the sum.
Function 2 = mergeValue : Called as many times as the occurrence of key 'k' within the partition - 1
- Input: (Output of createCombiner: O, Subsequent value associated with the key 'k' in this partition)
- Output: (Output: O)
Function 3 = mergeCombiners : Called as many times as the partitions in which the key exists
- Input: (Output of mergeValue from Partition X: O, Output of mergeValue from partition Y: O)
- Output: (Output: O )

answered Oct 08 '22 12:10

Avinash Ganta

Related questions
                            
                                Define manually routes using Flask
                            
                                Using Scrapy to crawl a public FTP server
                            
                                How can I escape any of the special shell characters in a Python string?
                            
                                How to generate 2d numpy array?
                            
                                Using python's mock to temporarily delete an object from a dict
                            
                                Using enumerate function in while loops
                            
                                Boto SES - send_raw_email() to multiple recipients
                            
                                How to use Matlab's imresize in python
                            
                                Creating a real-time chat with Python and websocket
                            
                                django error TemplateDoesNotExist
                            
                                Best Way to add group totals to a dataframe in Pandas
                            
                                Sum of two variables in RobotFramework
                            
                                How to write cucumber Step definitions in python
                            
                                HTML table to pandas table: Info inside html tags
                            
                                python re.sub newline multiline dotall
                            
                                Signing data using openssl with python
                            
                                Can Luigi propagate exception or return any result?
                            
                                How can I draw axis lines inside a plot in Matplotlib?
                            
                                How to change font size of child QLabel widget from the groupBox
                            
                                How to model enums backed by integers with sqlachemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With