How to count unique ID after groupBy in pyspark

Tags:

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped =  gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.

I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.

821

asked Sep 26 '17 08:09

Lizou

1 Answers

Use countDistinct function

from pyspark.sql.functions import countDistinct x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")] y = spark.createDataFrame(x,["year","id"])  gr = y.groupBy("year").agg(countDistinct("id")) gr.show()

output

+----+------------------+ |year|count(DISTINCT id)| +----+------------------+ |2002|                 2| |2001|                 2| +----+------------------+

196

answered Oct 08 '22 09:10

pauli

Related questions
                            
                                What's the purpose of the "__package__" attribute in Python?
                            
                                BeatifulSoup4 get_text still has javascript
                            
                                Visual Studio Code - How to add multiple paths to python path?
                            
                                How to get a list of built-in modules in python?
                            
                                Python: Read several json files from a folder
                            
                                preprocess_input() method in keras
                            
                                How to customize the auth.User Admin page in Django CRUD?
                            
                                Creating HTML in python
                            
                                plotting results of hierarchical clustering ontop of a matrix of data in python
                            
                                Postpone code for later execution in python (like setTimeout in javascript) [duplicate]
                            
                                How to add column to numpy array
                            
                                Unsupported operation :not writeable python
                            
                                syntax error when using command line in python
                            
                                confidence and prediction intervals with StatsModels
                            
                                AttributeError: 'Flask' object has no attribute 'user_options'
                            
                                python pip on Windows - command 'cl.exe' failed
                            
                                Plot a histogram from a Dictionary
                            
                                How do you merge images into a canvas using PIL/Pillow?
                            
                                @Patch decorator is not compatible with pytest fixture
                            
                                Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count unique ID after groupBy in pyspark

Tags:

python

apache-spark-sql

pyspark

Lizou

People also ask

1 Answers

pauli

Recent Activity

Donate For Us