I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: <pre class="prettyprint"><code>>>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463), (u'2013-10-09', 20.76923076923077), (u'2013-10-08', 11.842105263157894), (u'2013-10-13', 32.32514177693762), (u'2013-10-13', 26.249999999999996), (u'2013-10-13', 10.693069306930692)] </code></pre> Now the following code sequence is a less than optimal way to do it, but it does work. It is what I was doing before I figured out a better solution. It's not terrible but -- as you'll see in the answer section -- there is a more concise, efficient way. <pre class="prettyprint"><code>>>> import operator >>> countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT of countsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...} >>> rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerators (i.e. the SUMs). >>> rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) # Divide each SUM by it's denominator (i.e. COUNT) >>> print(rdd1.collect()) [(u'2013-10-09', 11.235365503035176), (u'2013-10-07', 23.39500642456595), ... snip ... ] </code></pre>

Now a much better way to do this is to use the <code>rdd.aggregateByKey()</code> method. Because this method is so poorly documented in the Apache Spark with Python documentation -- and is why I wrote this Q&A -- until recently I had been using the above code sequence. But again, it's less efficient, so avoid doing it that way unless necessary. Here's how to do the same using the <code>rdd.aggregateByKey()</code> method (recommended): By KEY, simultaneously calculate the SUM (the numerator for the average that we want to compute), and COUNT (the denominator for the average that we want to compute): <pre class="prettyprint"><code>>>> aTuple = (0,0) # As of Python3, you can't pass a literal sequence to a function. >>> rdd1 = rdd1.aggregateByKey(aTuple, lambda a,b: (a[0] + b, a[1] + 1), lambda a,b: (a[0] + b[0], a[1] + b[1])) </code></pre> Where the following is true about the meaning of each <code>a</code> and <code>b</code> pair above (so you can visualize what's happening): <pre class="prettyprint"><code> First lambda expression for Within-Partition Reduction Step:: a: is a TUPLE that holds: (runningSum, runningCount). b: is a SCALAR that holds the next Value Second lambda expression for Cross-Partition Reduction Step:: a: is a TUPLE that holds: (runningSum, runningCount). b: is a TUPLE that holds: (nextPartitionsSum, nextPartitionsCount). </code></pre> Finally, calculate the average for each KEY, and collect results. <pre class="prettyprint"><code>>>> finalResult = rdd1.mapValues(lambda v: v[0]/v[1]).collect() >>> print(finalResult) [(u'2013-09-09', 11.235365503035176), (u'2013-09-01', 23.39500642456595), (u'2013-09-03', 13.53240060820617), (u'2013-09-05', 13.141148418977687), ... snip ... ] </code></pre> I hope this question and answer with <code>aggregateByKey()</code> will help.

To my mind a more readable equivalent to an aggregateByKey with two lambdas is: <pre class="prettyprint"><code>rdd1 = rdd1 \ .mapValues(lambda v: (v, 1)) \ .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) </code></pre> In this way the whole average calculation would be: <pre class="prettyprint"><code>avg_by_key = rdd1 \ .mapValues(lambda v: (v, 1)) \ .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \ .mapValues(lambda v: v[0]/v[1]) \ .collectAsMap() </code></pre>

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Tags:

python

aggregate

average

apache-spark

rdd

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor.

I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like:

>>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463), (u'2013-10-09', 20.76923076923077), (u'2013-10-08', 11.842105263157894), (u'2013-10-13', 32.32514177693762), (u'2013-10-13', 26.249999999999996), (u'2013-10-13', 10.693069306930692)]

Now the following code sequence is a less than optimal way to do it, but it does work. It is what I was doing before I figured out a better solution. It's not terrible but -- as you'll see in the answer section -- there is a more concise, efficient way.

>>> import operator >>> countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT of countsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...} >>> rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerators (i.e. the SUMs). >>> rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) # Divide each SUM by it's denominator (i.e. COUNT) >>> print(rdd1.collect())   [(u'2013-10-09', 11.235365503035176),    (u'2013-10-07', 23.39500642456595),    ... snip ...   ]

915

asked Apr 28 '15 21:04

NYCeyes

2 Answers

Now a much better way to do this is to use the rdd.aggregateByKey() method. Because this method is so poorly documented in the Apache Spark with Python documentation -- and is why I wrote this Q&A -- until recently I had been using the above code sequence. But again, it's less efficient, so avoid doing it that way unless necessary.

Here's how to do the same using the rdd.aggregateByKey() method (recommended):

By KEY, simultaneously calculate the SUM (the numerator for the average that we want to compute), and COUNT (the denominator for the average that we want to compute):

>>> aTuple = (0,0) # As of Python3, you can't pass a literal sequence to a function. >>> rdd1 = rdd1.aggregateByKey(aTuple, lambda a,b: (a[0] + b,    a[1] + 1),                                        lambda a,b: (a[0] + b[0], a[1] + b[1]))

Where the following is true about the meaning of each a and b pair above (so you can visualize what's happening):

   First lambda expression for Within-Partition Reduction Step::    a: is a TUPLE that holds: (runningSum, runningCount).    b: is a SCALAR that holds the next Value     Second lambda expression for Cross-Partition Reduction Step::    a: is a TUPLE that holds: (runningSum, runningCount).    b: is a TUPLE that holds: (nextPartitionsSum, nextPartitionsCount).

Finally, calculate the average for each KEY, and collect results.

>>> finalResult = rdd1.mapValues(lambda v: v[0]/v[1]).collect() >>> print(finalResult)       [(u'2013-09-09', 11.235365503035176),        (u'2013-09-01', 23.39500642456595),        (u'2013-09-03', 13.53240060820617),        (u'2013-09-05', 13.141148418977687),    ... snip ...   ]

I hope this question and answer with aggregateByKey() will help.

175

answered Oct 09 '22 01:10

NYCeyes

To my mind a more readable equivalent to an aggregateByKey with two lambdas is:

rdd1 = rdd1 \     .mapValues(lambda v: (v, 1)) \     .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1]))

In this way the whole average calculation would be:

avg_by_key = rdd1 \     .mapValues(lambda v: (v, 1)) \     .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \     .mapValues(lambda v: v[0]/v[1]) \     .collectAsMap()

answered Oct 09 '22 01:10

pat

Related questions
                            
                                Django: AppRegistryNotReady()
                            
                                Spyder 5 missing dependencies - spyder_kernels version error [closed]
                            
                                What does the ** maths operator do in Python?
                            
                                What is the best way to do automatic attribute assignment in Python, and is it a good idea?
                            
                                Automatically import models on Django shell launch
                            
                                Heroku & Django: "OSError: No such file or directory: '/app/{myappname}/static'"
                            
                                How can I pass parameters to a RequestHandler?
                            
                                How to activate different anaconda environment from powershell
                            
                                How do I set the content-type for POST requests in python-requests library?
                            
                                No module named 'tqdm'
                            
                                Using monotonically_increasing_id() for assigning row number to pyspark dataframe
                            
                                Read a file on App Engine with Python?
                            
                                Use fnmatch.filter to filter files by more than one possible file extension
                            
                                Python: Iterating through a dictionary gives me "int object not iterable"
                            
                                Can Pylint error checking be customized?
                            
                                Beautiful Soup find children for particular div
                            
                                How can I check the existence of attributes and tags in XML before parsing?
                            
                                Unpivot Pandas Data
                            
                                Using openpyxl to read file from memory
                            
                                How to remove parentheses and all data within using Pandas/Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Tags:

python

aggregate

average

apache-spark

rdd

NYCeyes

People also ask

2 Answers

NYCeyes

pat

Recent Activity

Donate For Us