Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count frequency of each categorical variable in a column in pyspark dataframe?

Say I have a pyspark dataframe:

df.show()
+-----+---+
|  x  |  y|
+-----+---+
|alpha|  1|
|beta |  2|
|gamma|  1|
|alpha|  2|
+-----+---+

I want to count how many occurrence alpha, beta and gamma there are in column x. How do I do this in pyspark?

like image 916
versatile parsley Avatar asked Jan 29 '23 11:01

versatile parsley


1 Answers

Use pyspark.sql.DataFrame.cube():

df.cube("x").count().show()
like image 93
versatile parsley Avatar answered Jan 30 '23 23:01

versatile parsley