We have a specific need wherein I will have to drop columns from a dataframe which has only one unique value in that column. The following is what we are doing
val rawdata = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filename)
Subsequently to find unique values in all columns we are using the HyperLog++ algorithm supported in spark
val cd_cols = rawdata.select(rawdata.columns.map(column => approxCountDistinct(col(column)).alias(column)): _*)
The output is
scala> cd_cols.show
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+
| ID|First Name|Last Name|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount|
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+
|1491| 172| 154| 43| 2| 3| 913| 3| 30| 1| 6| 1|
+----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+
Notice that I have two columns which has a 1 as the unique value. I want to create another dataframe which has all columns except those two columns (Education and EmployeeCount)
I tried using a for loop, but was not very happy and also tried
cd_cols.columns.filter(colName => cd_cols.filter(colName) <= 1)
that is also not working.
Is there a smarter way to do this please.
Thanks
Bala
You try the following command:
df.selectExpr(df.first().getValuesMap[Int](df.columns).filter(_._2 != 1).keys.toSeq: _*).show
Here we are first taking the first row of the dataframe and converting it into a map using getValueMap with the column names and just filtering the columns whose value is not 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With