I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list. For example: <code>(("TX":3),("NJ":2))</code> should be the output when there are two occurrences of <code>"TX"</code> and <code>"NJ"</code>. I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

I think you're looking to use the DataFrame idiom of groupBy and count. For example, given the following dataframe, one state per row: <pre class="prettyprint"><code>df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',)) df.show() +-----+ |state| +-----+ | TX| | NJ| | TX| | CA| | NJ| +-----+ </code></pre> The following yields: <pre class="prettyprint"><code>df.groupBy('state').count().show() +-----+-----+ |state|count| +-----+-----+ | TX| 2| | NJ| 2| | CA| 1| +-----+-----+ </code></pre>

How to calculate the counts of each distinct value in a pyspark dataframe?

1 Answers

I think you're looking to use the DataFrame idiom of groupBy and count.

For example, given the following dataframe, one state per row:

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',)) df.show() +-----+ |state| +-----+ |   TX| |   NJ| |   TX| |   CA| |   NJ| +-----+

The following yields:

df.groupBy('state').count().show() +-----+-----+ |state|count| +-----+-----+ |   TX|    2| |   NJ|    2| |   CA|    1| +-----+-----+

answered Sep 19 '22 02:09

eddies

Related questions
                            
                                hashlib.md5() TypeError: Unicode-objects must be encoded before hashing
                            
                                django.core.servers.basehttp.FileWrapper disappears in Django 1.9
                            
                                Python: how to implement __getattr__()?
                            
                                Add edge-weights to plot output in networkx
                            
                                Standard deviation in numpy [duplicate]
                            
                                Django Rest Framework POST Update if existing or create
                            
                                __init__ vs __enter__ in context managers
                            
                                Is there an easy way to populate SlugField from CharField?
                            
                                Converting time zone pandas dataframe
                            
                                Single command in python to install relevant modules from a package.json like file
                            
                                Pandas OHLC aggregation on OHLC data
                            
                                Is there a way to return literally nothing in python?
                            
                                how to use concatenate a fixed string and a variable in Python
                            
                                SQLalchemy not find table for creating foreign key
                            
                                iPython/Jupyter Notebook and Pandas, how to plot multiple graphs in a for loop?
                            
                                Python enumerate() tqdm progress-bar when reading a file?
                            
                                Closing pyplot windows
                            
                                Can't start foreman in Heroku Tutorial using Python
                            
                                Pandas: Multilevel column names
                            
                                Add column sum as new column in PySpark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate the counts of each distinct value in a pyspark dataframe?

Tags:

python

dataframe

pyspark

madsthaks

People also ask

1 Answers

eddies

Recent Activity

Donate For Us