A PySpark Dataframe
is in following format :
To just access the stddev
row of columns c1,c2,c3 I use :
df.describe().createOrReplaceTempView("table1")
df2 = sqlContext.sql("SELECT c1 AS f1, c2 as f2, c3 as f3 from table1")
ddd = df2.rdd.map(lambda x : (float(x.f1) , float(x.f2) , float(x.f3))).zipWithIndex().filter(lambda x: x[1] == 2).map(lambda x : x[0])
print type(ddd)
print type(ddd.collect())
print ddd.collect()
This prints :
<class 'pyspark.rdd.PipelinedRDD'>
<type 'list'>
[(0.7071067811865476, 0.7071067811865476, 0.7071067811865476)]
How to select the stddev
value for all columns : c1,c2,c3,c4,c5 and generate the datatype [(0.7071067811865476, 0.7071067811865476, 0.7071067811865476....
for these selections instead of hard coding each value into the SQL string ? So the number of columns can be variable : 5, 10 columns etc...
To accomplish this for 5 columns I think to use "SELECT c1 AS f1, c2 as f2, c3 as f3, c4 as f4, c5 as f5 from table1"
but is there a cleaner method instead of hardcoding each value in SQL and then correspondingly hard coding the value when generating the rdd : df2.rdd.map(lambda x : (float(x.f1) , float(x.f2).....
As my solution does not work for columns of differing lengths.
Selecting all columns can be quickly done using the asterisk, similar as in SQL:
df.select(df['*'])
You can also call an alias
on a dataframe and use the select
function:
df.alias("a").select("a.*")
Why not use SQL aggregations directly? Either with agg
from pyspark.sql.functions import stddev
df.agg(*[stddev(c) for c in df.columns]).first()
where *
is used for argument unpacking for agg(*exprs)
, or select
:
df.select([stddev(c) for c in df.columns]).first()
To drop names convert Row
to plain tuple
:
tuple(df.select(...).first())
or
df.select(...).rdd.map(tuple).first()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With