Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the counts of each distinct value in a pyspark dataframe?

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.

For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ".

I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

like image 290
madsthaks Avatar asked Feb 25 '17 02:02

madsthaks


People also ask

How do you count distinct in a DataFrame?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How does count work in PySpark?

The count function counts the data and returns the data to the driver in PySpark, making the type action in PySpark. This count function in PySpark is used to count the number of rows that are present in the data frame post/pre-data analysis.

How do you use distinct in PySpark?

distinct() in PySpark removes duplicate rows/data and returns the unique rows from the DataFrame. By using distinct() we can remove duplicate rows in the PySpark DataFrame. We can drop the columns from the DataFrame in two ways. Before that, we have to create PySpark DataFrame for demonstration.


1 Answers

I think you're looking to use the DataFrame idiom of groupBy and count.

For example, given the following dataframe, one state per row:

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',)) df.show() +-----+ |state| +-----+ |   TX| |   NJ| |   TX| |   CA| |   NJ| +-----+ 

The following yields:

df.groupBy('state').count().show() +-----+-----+ |state|count| +-----+-----+ |   TX|    2| |   NJ|    2| |   CA|    1| +-----+-----+ 
like image 52
eddies Avatar answered Sep 19 '22 02:09

eddies