Conditional mean over a Pandas DataFrame

Tags:

I have a dataset from which I want a few averages of multiple variables I created.

I started off with:

data2['socialIdeology2'].mean()

data2['econIdeology'].mean()

^ that works perfectly, and gives me the averages I'm looking for.

Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'

Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.

This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:

def mean():
    sum = 0.0
    count = 0
    for index in range(0, len(data2['socialIdeology2'])):
        sum = sum + (data2['socialIdeology2'][index])
        print(data2['socialIdeology2'][index])
        count = count + 1
    return sum / count

print(mean())

Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.

So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?

And how can I get generate means by category?

334

asked Jun 27 '17 19:06

Oliver G

2 Answers

Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():

means = data2.groupby('voteChoice').mean()

or maybe, in your case, the following would be more efficient:

means = data2.groupby('voteChoice')['socialIdeology2'].mean()

to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.

answered Sep 21 '22 08:09

Brad Solomon

If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:

voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()

If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:

means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()

Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.

See here for more info on indexing DataFrames using .loc and here for more info on groupby.

answered Sep 21 '22 08:09

ali_m

Related questions
                            
                                Django REST API: Make field read-only for certain permission level
                            
                                How to send image to Flask server from curl request
                            
                                Django generate csv file on view and download
                            
                                python merge set of fronzensets into one set
                            
                                a bytes-like object is required not 'str'
                            
                                Remove anaconda environment prefix from ubuntu terminal command prompt
                            
                                pyspark merge two rdd together
                            
                                Tensorflow Error : No Variables to optimize
                            
                                Python generate all possible strings of length n [duplicate]
                            
                                How to calculate a percentile ranking of a column of data relative to another column using python
                            
                                Extended regression lines with seaborn regplot
                            
                                ValueError: multiclass format is not supported , xgboost
                            
                                Splitting column value into 2 new columns - Python Pandas
                            
                                Python NLP Intent Identification
                            
                                Add new HTML tag after current tag
                            
                                Does conda update packages from pypi installed using pip install?
                            
                                Function object called via class attribute fails
                            
                                Can I access class variables using self?
                            
                                Pytesseract foreign language extraction using python
                            
                                Problems to serialize property (getter and setter) from a model using Django Rest Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Conditional mean over a Pandas DataFrame

Tags:

python

pandas

nan

mean

Oliver G

People also ask

2 Answers

Brad Solomon

ali_m

Recent Activity

Donate For Us