How do I detect if a Spark DataFrame has a column

Tags:

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select

Example JSON schema:

{   "a": {     "b": 1,     "c": 2   } }

This is what I want to do:

potential_columns = Seq("b", "c", "d") df = sqlContext.read.json(filename) potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))

but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:

scala> df.select("a.*").columns res17: Array[String] = Array(b, c)

781

asked Mar 09 '16 22:03

ben

2 Answers

Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:

import scala.util.Try import org.apache.spark.sql.DataFrame  def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess  val df = sqlContext.read.json(sc.parallelize(   """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))  hasColumn(df, "foobar") // Boolean = false  hasColumn(df, "foo") // Boolean = true  hasColumn(df, "foo.bar") // Boolean = true  hasColumn(df, "foo.bar.foobar") // Boolean = true  hasColumn(df, "foo.bar.foobaz") // Boolean = false

Or even simpler:

val columns = Seq(   "foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")  columns.flatMap(c => Try(df(c)).toOption) // Seq[org.apache.spark.sql.Column] = List( //   foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)

Python equivalent:

from pyspark.sql.utils import AnalysisException from pyspark.sql import Row   def has_column(df, col):     try:         df[col]         return True     except AnalysisException:         return False  df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()  has_column(df, "foobar") ## False  has_column(df, "foo") ## True  has_column(df, "foo.bar") ## True  has_column(df, "foo.bar.foobar") ## True  has_column(df, "foo.bar.foobaz") ## False

109

answered Oct 14 '22 14:10

zero323

Another option which I normally use is

df.columns.contains("column-name-to-check")

This returns a boolean

answered Oct 14 '22 14:10

Jai Prakash

Related questions
                            
                                java.lang.NoSuchMethodError: scala.Predef$.refArrayOps
                            
                                Increase JVM heap size for Scala?
                            
                                why is the lift web framework scalable?
                            
                                scala, guidelines on return type - when prefer seq, iterable, traversable
                            
                                How are coroutines implemented in JVM langs without JVM support?
                            
                                What is the relation between Iterable and Iterator?
                            
                                Should I use Unit or leave out the return type for my scala method?
                            
                                Using Either to process failures in Scala code
                            
                                Valid identifier characters in Scala
                            
                                The cost of nested methods
                            
                                scala.concurrent.blocking - what does it actually do?
                            
                                How to apply a function to a tuple?
                            
                                Append a column to Data Frame in Apache Spark 1.3
                            
                                Converting Java to Scala durations
                            
                                Scala - can a lambda parameter match a tuple?
                            
                                Empty partial function in Scala
                            
                                Get field names list from case class
                            
                                Printing array in Scala
                            
                                Explain the aggregate functionality in Spark (with Python and Scala)
                            
                                How to choose a random element from an array in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I detect if a Spark DataFrame has a column

Tags:

dataframe

scala

apache-spark

apache-spark-sql

ben

People also ask

2 Answers

zero323

Jai Prakash

Recent Activity

Donate For Us