df1.printSchema()
prints out the column names and the data type that they possess.
df1.drop($"colName")
will drop columns by their name.
Is there a way to adapt this command to drop by the data-type instead?
The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database. The Dataset is the distributed collection of the data.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.
import sqlContext.implicits._
val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")
df.schema.fields
.collect({case x if x.dataType.typeName == "string" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.drop(field)})
The schema of the newDf
is org.apache.spark.sql.DataFrame = [c2: int]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With