I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index]
within the loop prints nan
over and over again.
So my question is: if the data stored within the socialIdeology2
variable really is a nan
(which I don't understand how it could be), why is it that the .mean()
function works with it?
And how can I get generate means by category?
Applying an IF condition in Pandas DataFrameIf the number is equal or lower than 4, then assign the value of 'True' Otherwise, if the number is greater than 4, then assign the value of 'False'
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
To calculate mean values grouped on another column in pandas, we will use groupby, and then we will apply mean() method. Pandas allow us a direct method called mean() which calculates the average of the set passed into it.
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby()
:
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice
is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby
, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2']
index before the .mean()
means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean()
(i.e. data2.groupby('voteChoice').mean()['socialIdeology2']
) this computes the means over all columns and then selects only the 'socialIdeology2'
column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc
and here for more info on groupby
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With