I have a spark data frame df
. Is there a way of sub selecting a few columns using a list of these columns?
scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")
I know I can do something like df.select("b", "c")
. But suppose I have a list containing a few column names val cols = List("b", "c")
, is there a way to pass this to df.select? df.select(cols)
throws an error. Something like df.select(*cols)
as in python
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .
In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.
Use df.select(cols.head, cols.tail: _*)
Let me know if it works :)
Explanation from @Ben:
The key is the method signature of select:
select(col: String, cols: String*)
The cols:String*
entry takes a variable number of arguments. :_*
unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args
. See here and here for other examples.
You can typecast String to spark column like this:
import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)
Another option that I've just learnt.
import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With