Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)
At the moment I am doing this:
var validCols: List[String] = List()
for (col <- df_filtered.columns){
val count = df_filtered
.select(col)
.distinct
.count
println(col, count)
if (count >= 2){
validCols ++= List(col)
}
}
to build the list of column containing at least two distinct values, and then use it in a select().
Thank you !
I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
for (String column:df.columns()){
long count = df.select(column).distinct().count();
if(count == 1 && df.select(column).first().isNullAt(0)){
df = df.drop(column);
}
}
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.
Here's a scala example to remove null columns that only queries that data once (faster):
def removeNullColumns(df:DataFrame): DataFrame = {
var dfNoNulls = df
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
for(c <- df.columns) {
val uses = cnts.getAs[Long]("count("+c+")")
if ( uses == 0 ) {
dfNoNulls = dfNoNulls.drop(c)
}
}
return dfNoNulls
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With