How to select all columns instead of hard coding each one?

A PySpark Dataframe is in following format :

enter image description here

To just access the stddev row of columns c1,c2,c3 I use :


df2 = sqlContext.sql("SELECT c1 AS f1, c2 as f2, c3 as f3 from table1")
ddd = df2.rdd.map(lambda x : (float(x.f1) , float(x.f2) , float(x.f3))).zipWithIndex().filter(lambda x: x[1] == 2).map(lambda x : x[0])
print type(ddd)
print type(ddd.collect())
print ddd.collect()

This prints :

<class 'pyspark.rdd.PipelinedRDD'>
<type 'list'>
[(0.7071067811865476, 0.7071067811865476, 0.7071067811865476)]

How to select the stddev value for all columns : c1,c2,c3,c4,c5 and generate the datatype [(0.7071067811865476, 0.7071067811865476, 0.7071067811865476.... for these selections instead of hard coding each value into the SQL string ? So the number of columns can be variable : 5, 10 columns etc...

To accomplish this for 5 columns I think to use "SELECT c1 AS f1, c2 as f2, c3 as f3, c4 as f4, c5 as f5 from table1" but is there a cleaner method instead of hardcoding each value in SQL and then correspondingly hard coding the value when generating the rdd : df2.rdd.map(lambda x : (float(x.f1) , float(x.f2).....

As my solution does not work for columns of differing lengths.

Selecting all columns can be quickly done using the asterisk, similar as in SQL:


You can also call an alias on a dataframe and use the select function:

Why not use SQL aggregations directly? Either with agg

from pyspark.sql.functions import stddev

df.agg(*[stddev(c) for c in df.columns]).first()

where * is used for argument unpacking for agg(*exprs), or select:

df.select([stddev(c) for c in df.columns]).first()

To drop names convert Row to plain tuple:



