Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select all columns instead of hard coding each one?

A PySpark Dataframe is in following format :

enter image description here

To just access the stddev row of columns c1,c2,c3 I use :

df.describe().createOrReplaceTempView("table1")

df2 = sqlContext.sql("SELECT c1 AS f1, c2 as f2, c3 as f3 from table1")
ddd = df2.rdd.map(lambda x : (float(x.f1) , float(x.f2) , float(x.f3))).zipWithIndex().filter(lambda x: x[1] == 2).map(lambda x : x[0])
print type(ddd)
print type(ddd.collect())
print ddd.collect()

This prints :

<class 'pyspark.rdd.PipelinedRDD'>
<type 'list'>
[(0.7071067811865476, 0.7071067811865476, 0.7071067811865476)]

How to select the stddev value for all columns : c1,c2,c3,c4,c5 and generate the datatype [(0.7071067811865476, 0.7071067811865476, 0.7071067811865476.... for these selections instead of hard coding each value into the SQL string ? So the number of columns can be variable : 5, 10 columns etc...

To accomplish this for 5 columns I think to use "SELECT c1 AS f1, c2 as f2, c3 as f3, c4 as f4, c5 as f5 from table1" but is there a cleaner method instead of hardcoding each value in SQL and then correspondingly hard coding the value when generating the rdd : df2.rdd.map(lambda x : (float(x.f1) , float(x.f2).....

As my solution does not work for columns of differing lengths.

like image 862
blue-sky Avatar asked Feb 22 '17 00:02

blue-sky


2 Answers

Selecting all columns can be quickly done using the asterisk, similar as in SQL:

df.select(df['*'])

You can also call an alias on a dataframe and use the select function:

df.alias("a").select("a.*")
like image 152
Pengshe Avatar answered Sep 22 '22 12:09

Pengshe


Why not use SQL aggregations directly? Either with agg

from pyspark.sql.functions import stddev

df.agg(*[stddev(c) for c in df.columns]).first()

where * is used for argument unpacking for agg(*exprs), or select:

df.select([stddev(c) for c in df.columns]).first()

To drop names convert Row to plain tuple:

tuple(df.select(...).first())

or

df.select(...).rdd.map(tuple).first()
like image 41
zero323 Avatar answered Sep 21 '22 12:09

zero323