I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.
For example: (("TX":3),("NJ":2))
should be the output when there are two occurrences of "TX"
and "NJ"
.
I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
The count function counts the data and returns the data to the driver in PySpark, making the type action in PySpark. This count function in PySpark is used to count the number of rows that are present in the data frame post/pre-data analysis.
distinct() in PySpark removes duplicate rows/data and returns the unique rows from the DataFrame. By using distinct() we can remove duplicate rows in the PySpark DataFrame. We can drop the columns from the DataFrame in two ways. Before that, we have to create PySpark DataFrame for demonstration.
I think you're looking to use the DataFrame idiom of groupBy and count.
For example, given the following dataframe, one state per row:
df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',)) df.show() +-----+ |state| +-----+ | TX| | NJ| | TX| | CA| | NJ| +-----+
The following yields:
df.groupBy('state').count().show() +-----+-----+ |state|count| +-----+-----+ | TX| 2| | NJ| 2| | CA| 1| +-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With