Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional mean over a Pandas DataFrame

I have a dataset from which I want a few averages of multiple variables I created.

I started off with:

data2['socialIdeology2'].mean()

data2['econIdeology'].mean()

^ that works perfectly, and gives me the averages I'm looking for.

Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'

Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.

This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:

def mean():
    sum = 0.0
    count = 0
    for index in range(0, len(data2['socialIdeology2'])):
        sum = sum + (data2['socialIdeology2'][index])
        print(data2['socialIdeology2'][index])
        count = count + 1
    return sum / count

print(mean())

Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.

So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?

And how can I get generate means by category?

like image 334
Oliver G Avatar asked Jun 27 '17 19:06

Oliver G


People also ask

How do you use conditional statement in pandas?

Applying an IF condition in Pandas DataFrameIf the number is equal or lower than 4, then assign the value of 'True' Otherwise, if the number is greater than 4, then assign the value of 'False'

How do I get the mean of a column in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

How do you find the mean of one column based on another column in pandas?

To calculate mean values grouped on another column in pandas, we will use groupby, and then we will apply mean() method. Pandas allow us a direct method called mean() which calculates the average of the set passed into it.


2 Answers

Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():

means = data2.groupby('voteChoice').mean()

or maybe, in your case, the following would be more efficient:

means = data2.groupby('voteChoice')['socialIdeology2'].mean()

to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.

like image 52
Brad Solomon Avatar answered Sep 21 '22 08:09

Brad Solomon


If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:

voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()

If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:

means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()

Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.

See here for more info on indexing DataFrames using .loc and here for more info on groupby.

like image 30
ali_m Avatar answered Sep 21 '22 08:09

ali_m