I have an RDD whose partitions contain elements (pandas dataframes, as it happens) that can easily be turned into lists of rows. Think of it as looking something like this
rows_list = []
for word in 'quick brown fox'.split():
rows = []
for i,c in enumerate(word):
x = ord(c) + i
row = pyspark.sql.Row(letter=c, number=i, importance=x)
rows.append(row)
rows_list.append(rows)
rdd = sc.parallelize(rows_list)
rdd.take(2)
which gives
[[Row(importance=113, letter='q', number=0),
Row(importance=118, letter='u', number=1),
Row(importance=107, letter='i', number=2),
Row(importance=102, letter='c', number=3),
Row(importance=111, letter='k', number=4)],
[Row(importance=98, letter='b', number=0),
Row(importance=115, letter='r', number=1),
Row(importance=113, letter='o', number=2),
Row(importance=122, letter='w', number=3),
Row(importance=114, letter='n', number=4)]]
I want to turn it into a Spark DataFrame. I hoped that I could just do
rdd.toDF()
but that gives a useless structure
DataFrame[_1: struct<importance:bigint,letter:string,number:bigint>,
_2: struct<importance:bigint,letter:string,number:bigint>,
_3: struct<importance:bigint,letter:string,number:bigint>,
_4: struct<importance:bigint,letter:string,number:bigint>,
_5: struct<importance:bigint,letter:string,number:bigint>]
What I really want is a 3 column DataFrame, such as this
desired_df = sql_context.createDataFrame(sum(rows_list, []))
so that I can perform operations like
desired_df.agg(pyspark.sql.functions.sum('number')).take(1)
and get the answer 23.
What is the right way to go about this?
You have a RDD of lists of rows while you need RDD of rows; you can flatten rdd
with flatMap
and then convert it to data frame:
rdd.flatMap(lambda x: x).toDF().show()
+----------+------+------+
|importance|letter|number|
+----------+------+------+
| 113| q| 0|
| 118| u| 1|
| 107| i| 2|
| 102| c| 3|
| 111| k| 4|
| 98| b| 0|
| 115| r| 1|
| 113| o| 2|
| 122| w| 3|
| 114| n| 4|
| 102| f| 0|
| 112| o| 1|
| 122| x| 2|
+----------+------+------+
import pyspark.sql.functions as F
rdd.flatMap(lambda x: x).toDF().agg(F.sum('number')).show()
+-----------+
|sum(number)|
+-----------+
| 23|
+-----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With