That title, yes horrible, sorry. Here' what I mean: Here's the starting dataset
C1 C2
AA H
AB M
AC M
AA H
AA L
AC L
Then it would turn into a new dataset with 4 columns:
C1 CH CM CL
AA 2 0 1
AB 0 1 0
AC 0 1 1
You can use the pivot
api as following with groupBy
and agg
and other functions as
from pyspark.sql import functions as F
finaldf = df.groupBy("C1").pivot("C2").agg(F.count("C2").alias("count")).na.fill(0)
and you should have finaldf
as
+---+---+---+---+
| C1| H| L| M|
+---+---+---+---+
| AA| 2| 1| 0|
| AB| 0| 0| 1|
| AC| 0| 1| 1|
+---+---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With