Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Upacking a list to select multiple columns from a spark data frame

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

like image 398
Ben Avatar asked Oct 05 '22 02:10

Ben


People also ask

How do I select multiple columns in Spark data frame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do I get a list of columns in Spark DataFrame?

You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .

How do you select a list of columns in PySpark?

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.


3 Answers

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

like image 101
Shagun Sodhani Avatar answered Oct 07 '22 20:10

Shagun Sodhani


You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)
like image 30
Kshitij Kulshrestha Avatar answered Oct 07 '22 20:10

Kshitij Kulshrestha


Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)
like image 25
vEdwardpc Avatar answered Oct 07 '22 19:10

vEdwardpc