I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.
from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))
The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.
I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
Use countDistinct function
from pyspark.sql.functions import countDistinct x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")] y = spark.createDataFrame(x,["year","id"]) gr = y.groupBy("year").agg(countDistinct("id")) gr.show()
output
+----+------------------+ |year|count(DISTINCT id)| +----+------------------+ |2002| 2| |2001| 2| +----+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With