Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark : Convert DataFrame to RDD[string]

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String]

I converted a DataFrame df to RDD data:

data = df.rdd
type (data)
## pyspark.rdd.RDD 

the new RDD data contains Row

first = data.first()
type(first)
## pyspark.sql.types.Row

data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')

I'd like to convert Row to list of String , like example below:

u'aaa',u'bbb',u'ccc',u'ddd'

Thanks

like image 206
Toren Avatar asked Feb 17 '16 13:02

Toren


People also ask

How do you convert a DataFrame to a string in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql.

How do you convert RDD to string in PySpark?

Try x = all_coord_iso_rdd. take(4) . Then print(type(x)) - you'll see that it's a list (of tuples). Then just convert it to string.


1 Answers

PySpark Row is just a tuple and can be used as such. All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list:

data.map(list)

or if you expect different types:

data.map(lambda row: [str(c) for c in row])
like image 136
zero323 Avatar answered Oct 15 '22 20:10

zero323