I have <code>pyspark.rdd.PipelinedRDD</code> <code>(Rdd1)</code>. when I am doing <code>Rdd1.collect()</code>,it is giving result like below. <pre class="prettyprint"><code> [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}), (1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}), (2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}), (3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})] </code></pre> Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method My final data frame should be like below. <code>df.show()</code> should be like: <pre class="prettyprint"><code>+----------+-------+-------------------+ |CId |IID |Score | +----------+-------+-------------------+ |10 |4 |2.9996439803387602 | |10 |5 |1.6767412921625855 | |10 |3 |3.616726727464709 | |1 |4 |-1.5271512313750577| |1 |5 |1.9665475696370045 | |1 |3 |2.016527311459324 | |2 |4 |4.033642544526678 | |2 |5 |3.1517805604906313 | |2 |3 |6.230272144805092 | |3 |4 |2.9757316477407443 | |3 |5 |-1.5689126834176417| |3 |3 |-0.3924680103722977| +----------+-------+-------------------+ </code></pre> I can achieve this converting to rdd next applying collect, iteration and finally Data frame. but now I want to convert <code>pyspark.rdd.PipelinedRDD</code> to Dataframe with out using any <code>collect()</code> method. please let me know how to achieve this?

You want to do two things here: 1. flatten your data 2. put it into a dataframe One way to do it is as follows: First, let us flatten the dictionary: <pre class="prettyprint"><code>rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()]) </code></pre> When collecting the data, you get something like this: <pre class="prettyprint"><code>[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ... </code></pre> Then we can format the data and turn it into a dataframe: <pre class="prettyprint"><code>rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))\ .toDF(("CId", "IID", "Score"))\ .show() </code></pre> which gives you this: <pre class="prettyprint"><code>+---+---+-------------------+ |CId|IID| Score| +---+---+-------------------+ | 10| 3| 3.616726727464709| | 10| 4| 2.9996439803387602| | 10| 5| 1.6767412921625855| | 1| 3| 2.016527311459324| | 1| 4|-1.5271512313750577| | 1| 5| 1.9665475696370045| | 2| 3| 6.230272144805092| | 2| 4| 4.033642544526678| | 2| 5| 3.1517805604906313| | 3| 3|-0.3924680103722977| | 3| 4| 2.9757316477407443| | 3| 5|-1.5689126834176417| +---+---+-------------------+ </code></pre>

How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?

Tags:

python-3.x

apache-spark

rdd

apache-spark-sql

pyspark

I have pyspark.rdd.PipelinedRDD (Rdd1). when I am doing Rdd1.collect(),it is giving result like below.

 [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
 (1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
 (2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
 (3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]

Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method

My final data frame should be like below. df.show() should be like:

+----------+-------+-------------------+
|CId       |IID    |Score              |
+----------+-------+-------------------+
|10        |4      |2.9996439803387602 |
|10        |5      |1.6767412921625855 |
|10        |3      |3.616726727464709  |
|1         |4      |-1.5271512313750577|
|1         |5      |1.9665475696370045 |
|1         |3      |2.016527311459324  |
|2         |4      |4.033642544526678  |
|2         |5      |3.1517805604906313 |
|2         |3      |6.230272144805092  |
|3         |4      |2.9757316477407443 |
|3         |5      |-1.5689126834176417|
|3         |3      |-0.3924680103722977|
+----------+-------+-------------------+

I can achieve this converting to rdd next applying collect, iteration and finally Data frame.

but now I want to convert pyspark.rdd.PipelinedRDD to Dataframe with out using any collect() method.

please let me know how to achieve this?

203

asked Jan 05 '18 09:01

Sai

1 Answers

You want to do two things here: 1. flatten your data 2. put it into a dataframe

One way to do it is as follows:

First, let us flatten the dictionary:

rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()])

When collecting the data, you get something like this:

[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ...

Then we can format the data and turn it into a dataframe:

rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))\
    .toDF(("CId", "IID", "Score"))\
    .show()

which gives you this:

+---+---+-------------------+
|CId|IID|              Score|
+---+---+-------------------+
| 10|  3|  3.616726727464709|
| 10|  4| 2.9996439803387602|
| 10|  5| 1.6767412921625855|
|  1|  3|  2.016527311459324|
|  1|  4|-1.5271512313750577|
|  1|  5| 1.9665475696370045|
|  2|  3|  6.230272144805092|
|  2|  4|  4.033642544526678|
|  2|  5| 3.1517805604906313|
|  3|  3|-0.3924680103722977|
|  3|  4| 2.9757316477407443|
|  3|  5|-1.5689126834176417|
+---+---+-------------------+

116

answered Oct 17 '22 06:10

Oli

Related questions
                            
                                Conditionally add items to a list when defining the list?
                            
                                Synchronous sleep into asyncio coroutine
                            
                                How can I locate something on my screen quickly in Python?
                            
                                extract 7z file using python 3 [duplicate]
                            
                                Listing a filter object twice will return a blank list?
                            
                                What is the best way to send multiple HTTP requests in Python 3? [duplicate]
                            
                                Installing readline module in python3
                            
                                Able to instantiate python class, in spite of it being Abstract (using abc)
                            
                                How to install Python 3.3 (not 3.4) on OS X with Homebrew?
                            
                                Python3: Why does __spec__ work?
                            
                                python dynamic array access [:0] [duplicate]
                            
                                What batch_size and pre_dispatch in joblib exactly mean
                            
                                How to set automatically the width of a column in xlsxwriter
                            
                                Add file to tar archive without saving it first
                            
                                IP Spoofing in python 3
                            
                                Virtualenv and Pip hanging forever
                            
                                Python - with open() except (FileNotFoundError)? [duplicate]
                            
                                Typing Decorator with Parameters in MyPy with TypeVar yields expected uninhabited type
                            
                                BeautifulSoup.find_all() method not working with namespaced tags
                            
                                Django delete cache with specific key_prefix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With