I'd like to convert pyspark.sql.dataframe.DataFrame
to pyspark.rdd.RDD[String]
I converted a DataFrame df
to RDD data
:
data = df.rdd
type (data)
## pyspark.rdd.RDD
the new RDD data
contains Row
first = data.first()
type(first)
## pyspark.sql.types.Row
data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
I'd like to convert Row
to list of String
, like example below:
u'aaa',u'bbb',u'ccc',u'ddd'
Thanks
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql.
Try x = all_coord_iso_rdd. take(4) . Then print(type(x)) - you'll see that it's a list (of tuples). Then just convert it to string.
PySpark Row
is just a tuple
and can be used as such. All you need here is a simple map
(or flatMap
if you want to flatten the rows as well) with list
:
data.map(list)
or if you expect different types:
data.map(lambda row: [str(c) for c in row])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With