Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dropping columns by data type in Scala Spark

df1.printSchema() prints out the column names and the data type that they possess.

df1.drop($"colName") will drop columns by their name.

Is there a way to adapt this command to drop by the data-type instead?

like image 259
Leothorn Avatar asked Jan 29 '17 07:01

Leothorn


People also ask

How do you drop a column in Scala spark?

The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database. The Dataset is the distributed collection of the data.

How do I select multiple columns in spark data frame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.


1 Answers

If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.

import sqlContext.implicits._

val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")

df.schema.fields
    .collect({case x if x.dataType.typeName == "string" => x.name})
    .foldLeft(df)({case(dframe,field) => dframe.drop(field)})

The schema of the newDf is org.apache.spark.sql.DataFrame = [c2: int]

like image 91
rogue-one Avatar answered Sep 26 '22 19:09

rogue-one