How can I count the occurrences of a String in a df
Column using Spark partitioned by id
?
e.g. Find the value "test"
in column "name"
of a df
In SQL would be:
SELECT
SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
FROM
mytable
WINDOW window AS (PARTITION BY id)
I've tried using map( v => match { case "test" -> 1.. })
and things like:
def getCount(df: DataFrame): DataFrame = {
val dfCnt = df.agg(
.withColumn("cnt_test",
count(col("name")==lit('test'))
)
Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)
?
thanks
You can use groupBy
+ agg
in spark; Here when($"name" == "test", 1)
transforms name
column to 1
if name == 'test'
, null
otherwise, and count
gives count of non null values:
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))
Example:
val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
Or similar to your sql queries:
df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With