Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark scala - find non-zero rows in a df

I have more than 100 columns in a dataframe. Out of 100 columns, 90 are metric columns. I need to find rows that has atleast one of the metric is not 0. I am filtering something like metric1 <> 0 or metric2 <> 0 and so on.. is there any trick to handle the situation better ?

like image 917
Ram Avatar asked Dec 07 '25 05:12

Ram


2 Answers

Here are some more options, all presuming that the target columns have names such as metric1, metric2, metric3 ... metricN.

First let's identify the target columns:

val targetColumns = df.columns.filter(_.matches("metric\d+"))

Option1: Filter using greatest which will return the column with the larger value:

import org.apache.spark.sql.functions.greatest

df.filter(greatest(targetColumns:_*) != 0)

Option2: Applying bitwise OR between columns:

import org.apache.spark.sql.functions.col

val bitwiseORCols = targetColumns.map(col).reduce(_ bitwiseOR _)

df.filter(bitwiseORCols != 0)
like image 128
abiratsis Avatar answered Dec 08 '25 18:12

abiratsis


You can make an array column from your metrics columns and use an udf to check exists non zero values in that array column you created.

scala> df.show
+---+-----+-------+-------+-------+
| id| name|metric1|metric2|metric3|
+---+-----+-------+-------+-------+
|  1|name1|      3|      0|      0|
|  2|name2|      0|      0|      0|
|  3|name3|      0|      3|      3|
|  4|name4|      0|      0|      0|
+---+-----+-------+-------+-------+


scala> def arrayNotAllZeros[T](a: Seq[T]):Boolean = {
     |   a.exists(_ != 0)
     | } 
arrayNotAllZeros: [T](a: Seq[T])Boolean

scala> 

scala> val myUdf = udf { arrayNotAllZeros[Int] _ }
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(IntegerType,false))))

scala> 

scala> val metricCols = df.columns.takeRight(3)
metricCols: Array[String] = Array(metric1, metric2, metric3)

scala> df.withColumn("nonZeroRow", myUdf(array(metricCols.head, metricCols.tail:_*))).show
+---+-----+-------+-------+-------+----------+
| id| name|metric1|metric2|metric3|nonZeroRow|
+---+-----+-------+-------+-------+----------+
|  1|name1|      3|      0|      0|      true|
|  2|name2|      0|      0|      0|     false|
|  3|name3|      0|      3|      3|      true|
|  4|name4|      0|      0|      0|     false|
+---+-----+-------+-------+-------+----------+
like image 36
C.S.Reddy Gadipally Avatar answered Dec 08 '25 18:12

C.S.Reddy Gadipally