Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to select columns in pySpark dataframe from a variable in Python

Tags:

I have a pySpark dataframe in python as -

from pyspark.sql.functions import col
dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("key"))

the column name is key and I would like to select this column using a variable.

myvar = "key"

now I want to select this column using the myvar variable in perhaps a select statement

I tried this

dataset.createOrReplaceTempView("dataset")
spark.sql(" select $myvar from dataset ").show

but it returns me an error

no viable alternative at input 'select $'(line 1, pos 8)

How do I achieve this in pySpark?

Note that I may have different columns in future and I want to pass more than 1 variables or perhaps a list into SELECT clause.

like image 466
Regressor Avatar asked Sep 13 '19 03:09

Regressor


1 Answers

dataset.select(myVar) will select a single column based on variable

.select can also take a list dataset.select([myVar, mySecondVar])

like image 134
Daniel Avatar answered Jan 04 '23 18:01

Daniel