Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce by key in python

Tags:

python

reduce

I'm trying to think through the most efficient way to do this in python.

Suppose I have a list of tuples:

[('dog',12,2), ('cat',15,1), ('dog',11,1), ('cat',15,2), ('dog',10,3), ('cat',16,3)]

And suppose I have a function which takes two of these tuples and combines them:

def my_reduce(obj1, obj2):
    return (obj1[0],max(obj1[1],obj2[1]),min(obj1[2],obj2[2]))

How do I perform an efficient reduce by 'key' where the key here could be the first value, so the final result would be something like:

[('dog',12,1), ('cat',16,1)]
like image 824
mgoldwasser Avatar asked Apr 29 '15 02:04

mgoldwasser


People also ask

What is reduce () in python?

Python's reduce() is a function that implements a mathematical technique called folding or reduction. reduce() is useful when you need to apply a function to an iterable and reduce it to a single cumulative value.

When to use reduce by key?

In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

What is reduce and reduce by key in spark?

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

What does reduceByKey do in Pyspark?

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.


2 Answers

Alternatively, if you have pandas installed:

import pandas as pd

l = [('dog',12,2), ('cat',15,1), ('dog',11,1), ('cat',15,2), ('dog',10,3), ('cat',16,3)]

pd.DataFrame(data=l, columns=['animal', 'm', 'n']).groupby('animal').agg({'m':'max', 'n':'min'})
Out[6]: 
         m  n
animal       
cat     16  1
dog     12  1

To get the original format:

zip(df.index, *df.values.T) # df is the result above
Out[14]: [('cat', 16, 1), ('dog', 12, 1)]
like image 170
Anzel Avatar answered Oct 27 '22 10:10

Anzel


I don't think reduce is a good tool for this job, because you will have to first use itertools or similar to group the list by the key. Otherwise you will be comparing cats and dogs and all hell will break loose!

Instead just a simple loop is fine:

>>> my_list = [('dog',12,2), ('cat',15,1), ('dog',11,1), ('cat',15,2)]
>>> output = {}
>>> for animal, high, low in my_list:
...     try:
...         prev_high, prev_low = output[animal]
...     except KeyError:
...         output[animal] = high, low
...     else:
...         output[animal] = max(prev_high, high), min(prev_low, low)

Then if you want the original format back:

>>> output = [(k,) + v for k, v in output.items()]
>>> output
[('dog', 12, 1), ('cat', 15, 1)]

Note this will destroy the ordering from the original list. If you want to preserve the order the keys first appear in, initialise output with an OrderedDict instead.

like image 41
wim Avatar answered Oct 27 '22 09:10

wim