I have a spark data frame <code>df</code>. Is there a way of sub selecting a few columns using a list of these columns? <pre class="prettyprint"><code>scala> df.columns res0: Array[String] = Array("a", "b", "c", "d") </code></pre> I know I can do something like <code>df.select("b", "c")</code>. But suppose I have a list containing a few column names <code>val cols = List("b", "c")</code>, is there a way to pass this to df.select? <code>df.select(cols)</code> throws an error. Something like <code>df.select(*cols)</code> as in python

Use <code>df.select(cols.head, cols.tail: _*)</code> Let me know if it works :) Explanation from @Ben: The key is the method signature of select: <pre class="prettyprint"><code>select(col: String, cols: String*) </code></pre> The <code>cols:String*</code> entry takes a variable number of arguments. <code>:_*</code> unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with <code>*args</code>. See here and here for other examples.

You can typecast String to spark column like this: <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ df.select(cols.map(col): _*) </code></pre>

Upacking a list to select multiple columns from a spark data frame

Tags:

apache-spark

apache-spark-sql

spark-dataframe

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

398

asked Oct 05 '22 02:10

Ben

3 Answers

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

101

answered Oct 07 '22 20:10

Shagun Sodhani

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)

answered Oct 07 '22 20:10

Kshitij Kulshrestha

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)

answered Oct 07 '22 19:10

vEdwardpc

Related questions
                            
                                What conditions should cluster deploy mode be used instead of client?
                            
                                View RDD contents in Python Spark?
                            
                                Spark load data and add filename as dataframe column
                            
                                Convert date from String to Date format in Dataframes
                            
                                PySpark: multiple conditions in when clause
                            
                                Find maximum row per group in Spark DataFrame
                            
                                Append a column to Data Frame in Apache Spark 1.3
                            
                                Pyspark replace strings in Spark dataframe column
                            
                                Explain the aggregate functionality in Spark (with Python and Scala)
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
                            
                                Difference between == and === in Scala, Spark
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Pyspark: Pass multiple columns in UDF
                            
                                Importing spark.implicits._ in scala
                            
                                Which operations preserve RDD order?
                            
                                Why does a job fail with "No space left on device", but df says otherwise?
                            
                                What is the difference between Apache Mahout and Apache Spark's MLlib?
                            
                                PySpark groupByKey returning pyspark.resultiterable.ResultIterable
                            
                                Median / quantiles within PySpark groupBy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With