Spark - Scala - Remove Columns from a dataframe based on condition

Question

We have a specific need wherein I will have to drop columns from a dataframe which has only one unique value in that column. The following is what we are doing

val rawdata = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filename)

Subsequently to find unique values in all columns we are using the HyperLog++ algorithm supported in spark

val cd_cols  = rawdata.select(rawdata.columns.map(column => approxCountDistinct(col(column)).alias(column)): _*)

The output is

scala> cd_cols.show
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+
|  ID|First Name|Last Name|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount|
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+
|1491|       172|      154| 43|        2|             3|      913|         3|              30|        1|             6|            1|
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+

Notice that I have two columns which has a 1 as the unique value. I want to create another dataframe which has all columns except those two columns (Education and EmployeeCount)

I tried using a for loop, but was not very happy and also tried

cd_cols.columns.filter(colName => cd_cols.filter(colName) <= 1)

that is also not working.

Is there a smarter way to do this please.

Thanks

Bala

Rajat Mishra · Accepted Answer

You try the following command:

df.selectExpr(df.first().getValuesMap[Int](df.columns).filter(_._2 != 1).keys.toSeq: _*).show

Here we are first taking the first row of the dataframe and converting it into a map using getValueMap with the column names and just filtering the columns whose value is not 1.

Spark - Scala - Remove Columns from a dataframe based on condition

Tags:

scala

apache-spark

Balaji Krishnan

1 Answers

Rajat Mishra

Recent Activity

Donate For Us

Spark - Scala - Remove Columns from a dataframe based on condition

Tags:

scala

apache-spark

Balaji Krishnan

1 Answers

Rajat Mishra

Related questions

Recent Activity

Donate For Us