Select Specific Columns from Spark DataFrame

Video Answer

2 Answers

If you want to split you dataframe into two different ones, do two selects on it with the different columns you want.

 val sourceDf = spark.read.csv(...)
 val df1 = sourceDF.select("first column", "second column", "third column")
 val df2 = sourceDF.select("first column", "second column", "third column")

Note that this of course means that the sourceDf would be evaluated twice, so if it can fit into distributed memory and you use most of the columns across both dataframes it might be a good idea to cache it. It it has many extra columns that you don't need, then you can do a select on it first to select on the columns you will need so it would store all that extra data in memory.

177

answered Oct 10 '22 15:10

puhlen

There are multiple options (especially in Scala) to select a subset of columns of that Dataframe. The following lines show the options and most of them are documented in the ScalaDocs of Column:

Click to copy

import spark.implicits._
import org.apache.spark.sql.functions.{col, column, expr}

inputDf.select(col("colA"), col("colB"))
inputDf.select(inputDf.col("colA"), inputDf.col("colB"))
inputDf.select(column("colA"), column("colB"))
inputDf.select(expr("colA"), expr("colB"))

// only available in Scala
inputDf.select($"colA", $"colB")
inputDf.select('colA, 'colB) // makes use of Scala's Symbol

// selecting columns based on a given iterable of Strings
val selectedColumns: Seq[Column] = Seq("colA", "colB").map(c => col(c))
inputDf.select(selectedColumns: _*)

// Special cases
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.

// select the first or last 2 columns
inputDf.selectExpr(inputDf.columns.take(2): _*)
inputDf.selectExpr(inputDf.columns.takeRight(2): _*)

The usage of $ is possible as Scala provides an implicit class that converts a String into a Column using the method $:

Click to copy

implicit class StringToColumn(val sc : scala.StringContext) extends scala.AnyRef {
  def $(args : scala.Any*) : org.apache.spark.sql.ColumnName = { /* compiled code */ }
}

Typically, when you want to derive one DataFrame to multiple DataFrames it might improve your performance if you persist the original DataFrame before creating the others. At the end you can unpersist the original DataFrame.

Keep in mind that Columns are not resolved at compile time but only when it is compared to the column names of your catalog which happens during analyser phase of the query execution. In case you need stronger type safety you could create a Dataset.

For completeness, here is the csv to try out above code:

Click to copy

// csv file:
// colA,colB,colC
// 1,"foo","bar"

val inputDf = spark.read.format("csv").option("header", "true").load(csvFilePath)

// resulting DataFrame schema
root
 |-- colA: string (nullable = true)
 |-- colB: string (nullable = true)
 |-- colC: string (nullable = true)

answered Oct 10 '22 15:10

Michael Heil

Related questions
                            
                                Reading DataFrame from partitioned parquet file
                            
                                Why doesn't Option have a fold method?
                            
                                Intersection of multiple implicit conversions: reinventing the wheel?
                            
                                Scala actors & Ambient Reference
                            
                                How to add a new Class in a Scala Compiler Plugin?
                            
                                How Jack (Java Android Compiler Kit) will affect Scala developers
                            
                                Recommended Scala io library
                            
                                Idiomatic way to write multi-project builds with .sbt files in sbt 0.13
                            
                                Scala: implementing method with return type of concrete instance
                            
                                Play 2.0+Java vs. Play 2.0+Scala?
                            
                                What happened to Scala.React?
                            
                                Is it possible in Intellij IDEA Scala plugin to know which implicit conversion was applied?
                            
                                Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?
                            
                                What's the best way to create a dynamically growing array in Scala?
                            
                                Sign CSR using Bouncy Castle
                            
                                What is and when to use Scala's forSome keyword?
                            
                                Functional languages targeting the LLVM
                            
                                Explicit self-references with no type / difference with ''this''
                            
                                How can I find a description of scala compiler flags/options?
                            
                                Getting "cat: /release: No such file or directory" when running scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select Specific Columns from Spark DataFrame

Tags:

scala

apache-spark

apache-spark-sql

A.HADDAD

People also ask

Video Answer

2 Answers

puhlen

Michael Heil

Recent Activity

Donate For Us