Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Spark - Count occurrences of a specific string in Dataframe column

How can I count the occurrences of a String in a df Column using Spark partitioned by id?

e.g. Find the value "test" in column "name" of a df

In SQL would be:

 SELECT
    SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
  FROM
    mytable
 WINDOW window AS (PARTITION BY id)

I've tried using map( v => match { case "test" -> 1.. })

and things like:

def getCount(df: DataFrame): DataFrame = {
    val dfCnt = df.agg(
          .withColumn("cnt_test", 
            count(col("name")==lit('test'))
)

Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)?

thanks

like image 783
Lou_Ds Avatar asked Jan 29 '23 08:01

Lou_Ds


1 Answers

You can use groupBy + agg in spark; Here when($"name" == "test", 1) transforms name column to 1 if name == 'test', null otherwise, and count gives count of non null values:

df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))

Example:

val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

Or similar to your sql queries:

df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+
like image 148
Psidom Avatar answered Feb 02 '23 08:02

Psidom