Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count unique ID after groupBy in pyspark

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped =  gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) 

The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.

I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.

like image 821
Lizou Avatar asked Sep 26 '17 08:09

Lizou


People also ask

How do you count unique values in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you use group by and count in PySpark?

When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.

How do you count unique records of a DataFrame?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.


1 Answers

Use countDistinct function

from pyspark.sql.functions import countDistinct x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")] y = spark.createDataFrame(x,["year","id"])  gr = y.groupBy("year").agg(countDistinct("id")) gr.show() 

output

+----+------------------+ |year|count(DISTINCT id)| +----+------------------+ |2002|                 2| |2001|                 2| +----+------------------+ 
like image 196
pauli Avatar answered Oct 08 '22 09:10

pauli