I have a <code>DataFrame</code> with the schema <pre class="prettyprint"><code>root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable = true) </code></pre> While, I am able to filter the data frame using <pre class="prettyprint"><code> val data = rawData .filter( !(rawData("features.feat1") <=> "100") ) </code></pre> I am unable to drop the columns using <pre class="prettyprint"><code> val data = rawData .drop("features.feat1") </code></pre> Is it something that I am doing wrong here? I also tried (unsuccessfully) doing <code>drop(rawData("features.feat1"))</code>, though it does not make much sense to do so. Thanks in advance, Nikhil

This version allows you to remove nested columns at any level: <pre class="prettyprint"><code>import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructType, DataType} /** * Various Spark utilities and extensions of DataFrame */ object DataFrameUtils { private def dropSubColumn(col: Column, colType: DataType, fullColName: String, dropColName: String): Option[Column] = { if (fullColName.equals(dropColName)) { None } else { colType match { case colType: StructType => if (dropColName.startsWith(s"${fullColName}.")) { Some(struct( colType.fields .flatMap(f => dropSubColumn(col.getField(f.name), f.dataType, s"${fullColName}.${f.name}", dropColName) match { case Some(x) => Some(x.alias(f.name)) case None => None }) : _*)) } else { Some(col) } case other => Some(col) } } } protected def dropColumn(df: DataFrame, colName: String): DataFrame = { df.schema.fields .flatMap(f => { if (colName.startsWith(s"${f.name}.")) { dropSubColumn(col(f.name), f.dataType, f.name, colName) match { case Some(x) => Some((f.name, x)) case None => None } } else { None } }) .foldLeft(df.drop(colName)) { case (df, (colName, column)) => df.withColumn(colName, column) } } /** * Extended version of DataFrame that allows to operate on nested fields */ implicit class ExtendedDataFrame(df: DataFrame) extends Serializable { /** * Drops nested field from DataFrame * * @param colName Dot-separated nested field name */ def dropNestedColumn(colName: String): DataFrame = { DataFrameUtils.dropColumn(df, colName) } } } </code></pre> Usage: <pre class="prettyprint"><code>import DataFrameUtils._ df.dropNestedColumn("a.b.c.d") </code></pre>

Dropping a nested column from Spark DataFrame

Tags:

dataframe

scala

apache-spark

apache-spark-sql

apache-spark-ml

I have a DataFrame with the schema

root  |-- label: string (nullable = true)  |-- features: struct (nullable = true)  |    |-- feat1: string (nullable = true)  |    |-- feat2: string (nullable = true)  |    |-- feat3: string (nullable = true)

While, I am able to filter the data frame using

  val data = rawData      .filter( !(rawData("features.feat1") <=> "100") )

I am unable to drop the columns using

  val data = rawData        .drop("features.feat1")

Is it something that I am doing wrong here? I also tried (unsuccessfully) doing drop(rawData("features.feat1")), though it does not make much sense to do so.

Thanks in advance,

Nikhil

859

asked Sep 22 '15 21:09

Nikhil J Joshi

2 Answers

It is just a programming exercise but you can try something like this:

import org.apache.spark.sql.{DataFrame, Column} import org.apache.spark.sql.types.{StructType, StructField} import org.apache.spark.sql.{functions => f} import scala.util.Try  case class DFWithDropFrom(df: DataFrame) {   def getSourceField(source: String): Try[StructField] = {     Try(df.schema.fields.filter(_.name == source).head)   }    def getType(sourceField: StructField): Try[StructType] = {     Try(sourceField.dataType.asInstanceOf[StructType])   }    def genOutputCol(names: Array[String], source: String): Column = {     f.struct(names.map(x => f.col(source).getItem(x).alias(x)): _*)   }    def dropFrom(source: String, toDrop: Array[String]): DataFrame = {     getSourceField(source)       .flatMap(getType)       .map(_.fieldNames.diff(toDrop))       .map(genOutputCol(_, source))       .map(df.withColumn(source, _))       .getOrElse(df)   } }

Example usage:

scala> case class features(feat1: String, feat2: String, feat3: String) defined class features  scala> case class record(label: String, features: features) defined class record  scala> val df = sc.parallelize(Seq(record("a_label",  features("f1", "f2", "f3")))).toDF df: org.apache.spark.sql.DataFrame = [label: string, features: struct<feat1:string,feat2:string,feat3:string>]  scala> DFWithDropFrom(df).dropFrom("features", Array("feat1")).show +-------+--------+ |  label|features| +-------+--------+ |a_label| [f2,f3]| +-------+--------+   scala> DFWithDropFrom(df).dropFrom("foobar", Array("feat1")).show +-------+----------+ |  label|  features| +-------+----------+ |a_label|[f1,f2,f3]| +-------+----------+   scala> DFWithDropFrom(df).dropFrom("features", Array("foobar")).show +-------+----------+ |  label|  features| +-------+----------+ |a_label|[f1,f2,f3]| +-------+----------+

Add an implicit conversion and you're good to go.

answered Sep 29 '22 20:09

zero323

This version allows you to remove nested columns at any level:

import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructType, DataType}  /**   * Various Spark utilities and extensions of DataFrame   */ object DataFrameUtils {    private def dropSubColumn(col: Column, colType: DataType, fullColName: String, dropColName: String): Option[Column] = {     if (fullColName.equals(dropColName)) {       None     } else {       colType match {         case colType: StructType =>           if (dropColName.startsWith(s"${fullColName}.")) {             Some(struct(               colType.fields                 .flatMap(f =>                   dropSubColumn(col.getField(f.name), f.dataType, s"${fullColName}.${f.name}", dropColName) match {                     case Some(x) => Some(x.alias(f.name))                     case None => None                   })                 : _*))           } else {             Some(col)           }         case other => Some(col)       }     }   }    protected def dropColumn(df: DataFrame, colName: String): DataFrame = {     df.schema.fields       .flatMap(f => {         if (colName.startsWith(s"${f.name}.")) {           dropSubColumn(col(f.name), f.dataType, f.name, colName) match {             case Some(x) => Some((f.name, x))             case None => None           }         } else {           None         }       })       .foldLeft(df.drop(colName)) {         case (df, (colName, column)) => df.withColumn(colName, column)       }   }    /**     * Extended version of DataFrame that allows to operate on nested fields     */   implicit class ExtendedDataFrame(df: DataFrame) extends Serializable {     /**       * Drops nested field from DataFrame       *       * @param colName Dot-separated nested field name       */     def dropNestedColumn(colName: String): DataFrame = {       DataFrameUtils.dropColumn(df, colName)     }   } }

Usage:

import DataFrameUtils._ df.dropNestedColumn("a.b.c.d")

answered Sep 29 '22 19:09

Michael Spector

Related questions
                            
                                Spark Scala: How to transform a column in a DF
                            
                                List.empty vs. List() vs. new List()
                            
                                Should x._1,x._2... syntax be avoided?
                            
                                Connection pooling in slick?
                            
                                How to checkpoint DataFrames?
                            
                                How does `isInstanceOf` work?
                            
                                Scala Compiler not found in Intellij IDEA 11 with Play 2.0 project
                            
                                Scala final vs val for concurrency visibility
                            
                                Intellij scala worksheet can't find project classes
                            
                                How can I connect to a postgreSQL database in scala?
                            
                                Why can't I pattern match on Stream.empty in Scala?
                            
                                How to convert Map[A,Future[B]] to Future[Map[A,B]]?
                            
                                Map a Future for both Success and Failure
                            
                                Boot exception when restarting Play
                            
                                In Scala, is there a way to take convert two lists into a Map?
                            
                                How to match a string on a prefix and get the rest?
                            
                                How does the pyspark mapPartitions function work?
                            
                                Scala Option[Future[T]] to Future[Option[T]]
                            
                                A binding to play.api.db.DBApi was already configured, evolutions and injector error with play-slick
                            
                                Filter Map by key set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With