I am writing a Spark application and want to combine a set of Key-Value pairs <code>(K, V1), (K, V2), ..., (K, Vn)</code> into one Key-Multivalue pair <code>(K, [V1, V2, ..., Vn])</code>. I feel like I should be able to do this using the <code>reduceByKey</code> function with something of the flavor: <pre class="prettyprint"><code>My_KMV = My_KV.reduce(lambda a, b: a.append([b])) </code></pre> The error that I get when this occurs is: <blockquote> 'NoneType' object has no attribue 'append'. </blockquote> My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).

Map and ReduceByKey Input type and output type of <code>reduce</code> must be the same, therefore if you want to aggregate a list, you have to <code>map</code> the input to lists. Afterwards you combine the lists into one list. Combining lists You'll need a method to combine lists into one list. Python provides some methods to combine lists. <code>append</code> modifies the first list and will always return <code>None</code>. <pre class="prettyprint"><code>x = [1, 2, 3] x.append([4, 5]) # x is [1, 2, 3, [4, 5]] </code></pre> <code>extend</code> does the same, but unwraps lists: <pre class="prettyprint"><code>x = [1, 2, 3] x.extend([4, 5]) # x is [1, 2, 3, 4, 5] </code></pre> Both methods return <code>None</code>, but you'll need a method that returns the combined list, therefore just use the plus sign. <pre class="prettyprint"><code>x = [1, 2, 3] + [4, 5] # x is [1, 2, 3, 4, 5] </code></pre> Spark <pre class="prettyprint"><code>file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) \ .map(lambda actor: (actor.split(",")[0], actor)) \ # transform each value into a list .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \ # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5] .reduceByKey(lambda a, b: a + b) </code></pre> <hr> CombineByKey It's also possible to solve this with <code>combineByKey</code>, which is used internally to implement <code>reduceByKey</code>, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution. GroupByKey It's also possible to solve this with <code>groupByKey</code>, but it reduces parallelization and therefore could be much slower for big data sets.

Reduce a key-value pair into a key-list pair with Apache Spark

Tags:

python

apache-spark

rdd

pyspark

mapreduce

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something of the flavor:

My_KMV = My_KV.reduce(lambda a, b: a.append([b]))

The error that I get when this occurs is:

'NoneType' object has no attribue 'append'.

My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).

604

asked Nov 18 '14 19:11

TravisJ

1 Answers

Map and ReduceByKey

Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

Combining lists

You'll need a method to combine lists into one list. Python provides some methods to combine lists.

append modifies the first list and will always return None.

x = [1, 2, 3] x.append([4, 5]) # x is [1, 2, 3, [4, 5]]

extend does the same, but unwraps lists:

x = [1, 2, 3] x.extend([4, 5]) # x is [1, 2, 3, 4, 5]

Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.

x = [1, 2, 3] + [4, 5] # x is [1, 2, 3, 4, 5]

Spark

file = spark.textFile("hdfs://...") counts = file.flatMap(lambda line: line.split(" ")) \          .map(lambda actor: (actor.split(",")[0], actor)) \            # transform each value into a list          .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \           # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]          .reduceByKey(lambda a, b: a + b)

CombineByKey

It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.

GroupByKey

It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

114

answered Sep 30 '22 15:09

Christian Strempfer

Related questions
                            
                                Exporting Data from google colab to local machine
                            
                                Mini-languages in Python
                            
                                The inheritance of attributes using __init__
                            
                                Adding 'install_requires' to setup.py when making a python package
                            
                                Generating a dense matrix from a sparse matrix in numpy python
                            
                                Python saving multiple figures into one PDF file
                            
                                Matrix from Python to MATLAB
                            
                                BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are
                            
                                SQLAlchemy or psycopg2?
                            
                                Using openCV to overlay transparent image onto another image
                            
                                Python: Regular expression to match alpha-numeric not working?
                            
                                Extract / Identify Tables from PDF python [closed]
                            
                                Save base64 image in django file field
                            
                                Replace whole string if it contains substring in pandas
                            
                                Suppress the u'prefix indicating unicode' in python strings
                            
                                Are Generators Threadsafe?
                            
                                Adding model-wide help text to a django model's admin form
                            
                                Python MySQLDB: Get the result of fetchall in a list
                            
                                List append() in for loop [duplicate]
                            
                                Are there any alternatives to py2exe? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With