I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.
a b
----------
g 0
f 0
g 0
f 1
I can get the distinct rows using
val dfDistinct=df.select("a","b").distinct
which gives the following:
a b
----------
g 0
f 0
f 1
I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with
a b count
----------
g 0 2
f 0 1
f 1 1
I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.
Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.
Thanks
Dean
You can simply aggregate and count:
df.groupBy($"a", $"b").count
or a little bit more verbose:
import org.apache.spark.sql.functions.{count, lit}
df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))
Both are equivalent to a raw SQL aggregation:
df.registerTempTable("df")
sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
Also see Cross Tabulation
val g="g"
val f = "f"
val df = Seq(
(g, "0"),
(f, "0"),
(g, "0"),
(f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show
+---+---+---+
|a_b| 0| 1|
+---+---+---+
| g| 2| 0|
| f| 1| 1|
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With