This is a homework question: I have an <code>RDD</code> which is a collection os tuples. I also have function which returns a dictionary from each input tuple. Somehow, the opposite of reduce function. With map, I can easily go from a <code>RDD</code> of tuples to a <code>RDD</code> of dictionaries. But, since a dictionary is a collection of (key, value) pairs, I would like to convert the <code>RDD</code> of dictionaries into an <code>RDD</code> of (key, value) tuples with each dictionary contents. That way, if my <code>RDD</code> contains 10 tuples, then I get an <code>RDD</code> containing 10 dictionaries with 5 elements (for example), and finally I get an <code>RDD</code> of 50 tuples. I assume this has to be possible but, how? (Maybe the problem is that I don't know how this operation is called in English)

My 2 cents: There is a PairRDD function named "collectAsMap" that returns a dictionary from a RDD. Let me show you an example: <pre class="prettyprint"><code>sample = someRDD.sample(0, 0.0001, 0) sample_dict = sample.collectAsMap() print sample.collect() print sample_dict [('hi', 4123.0)] {'hi': 4123.0} </code></pre> Documentation here Hope it helps! Regards!

I guess what you want is just a <code>flatMap</code>: <pre class="prettyprint"><code>dicts = sc.parallelize([{"foo": 1, "bar": 2}, {"foo": 3, "baz": -1, "bar": 5}]) dicts.flatMap(lambda x: x.items()) </code></pre> <code>flatMap</code> takes a function from a element of RDD to iterable and then concatenates the results. Another name for the same type of operation outside the Spark context is <code>mapcat</code>: <pre class="prettyprint"><code>>>> from toolz.curried import map, mapcat, concat, pipe >>> from itertools import repeat >>> pipe(range(4), mapcat(lambda i: repeat(i, i + 1)), list) [0, 1, 1, 2, 2, 2, 3, 3, 3, 3] </code></pre> or going step by step: <pre class="prettyprint"><code>>>> pipe(range(4), map(lambda i: repeat(i, i + 1)), concat, list) [0, 1, 1, 2, 2, 2, 3, 3, 3, 3] </code></pre> The same thing using <code>itertools.chain</code> <pre class="prettyprint"><code>>>> from itertools import chain >>> pipe((repeat(i, i + 1) for i in range(4)), chain.from_iterable, list) >>> [0, 1, 1, 2, 2, 2, 3, 3, 3, 3] </code></pre>

Extracting a dictionary from an RDD in Pyspark

Tags:

python

apache-spark

pyspark

This is a homework question:

I have an RDD which is a collection os tuples. I also have function which returns a dictionary from each input tuple. Somehow, the opposite of reduce function.

With map, I can easily go from a RDD of tuples to a RDD of dictionaries. But, since a dictionary is a collection of (key, value) pairs, I would like to convert the RDD of dictionaries into an RDD of (key, value) tuples with each dictionary contents.

That way, if my RDD contains 10 tuples, then I get an RDD containing 10 dictionaries with 5 elements (for example), and finally I get an RDD of 50 tuples.

I assume this has to be possible but, how? (Maybe the problem is that I don't know how this operation is called in English)

711

asked Jun 23 '15 15:06

Roman Rdgz

2 Answers

My 2 cents:

There is a PairRDD function named "collectAsMap" that returns a dictionary from a RDD.

Let me show you an example:

sample = someRDD.sample(0, 0.0001, 0)
sample_dict = sample.collectAsMap()
print sample.collect()
print sample_dict

[('hi', 4123.0)]
{'hi': 4123.0}

Documentation here

Hope it helps! Regards!

answered Oct 16 '22 19:10

Leandro Mora

I guess what you want is just a flatMap:

dicts = sc.parallelize([{"foo": 1, "bar": 2}, {"foo": 3, "baz": -1, "bar": 5}])
dicts.flatMap(lambda x: x.items())

flatMap takes a function from a element of RDD to iterable and then concatenates the results. Another name for the same type of operation outside the Spark context is mapcat:

>>> from toolz.curried import map, mapcat, concat, pipe
>>> from itertools import repeat
>>> pipe(range(4), mapcat(lambda i: repeat(i, i + 1)), list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

or going step by step:

>>> pipe(range(4), map(lambda i: repeat(i, i + 1)), concat, list)
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

The same thing using itertools.chain

>>> from itertools import chain
>>> pipe((repeat(i, i + 1) for i in  range(4)), chain.from_iterable, list)
>>> [0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

answered Oct 16 '22 21:10

zero323

Related questions
                            
                                Unique together constraint including specific field value
                            
                                get Key by value, dict, python
                            
                                Querying from list of related in SQLalchemy and Flask
                            
                                converting text file to html file with python
                            
                                Python uuid4, How to limit the length of Unique chars
                            
                                Auto restart django development server on file save after previous error
                            
                                Matplotlib ignoring timezone
                            
                                How to know if object is of str or list or dict or int?
                            
                                Creating an empty deque in Python with a max length?
                            
                                Polling a stopping or starting EC2 instance with Boto
                            
                                Output 50 samples closest to each cluster center using scikit-learn.k-means library
                            
                                What is the meaning of string argument in django model's Field?
                            
                                Django Rest Framework 3.0 to_representation not implemented
                            
                                Python3.4 can't install mysql-python
                            
                                Get the id of the object recently created Django Rest Framework
                            
                                TypeError: Type str doesn't support the buffer API when splitting string
                            
                                How do I create a pie chart using Bokeh?
                            
                                Selenium/PhantomJS raises error
                            
                                Error importing Polygon from shapely.geometry.polygon
                            
                                How to get test cases list in Robot Framework without launching the actual tests?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With