Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Mean of columns with the same names

Tags:

python

pandas

I have a dataframe with columns like:

['id','name','foo1', 'foo1', 'foo1', 'foo2','foo2', 'foo3']

I would like to get a new dataframe where columns sharing the same name are averaged:

['id','name','foo1', 'foo2','foo3']

Here column foo1 would be the average of the three columns named foo1 in the original dataframe, foo2 would be the average of the two columns named foo2 and foo3 would be just foo3

Note: id and name are not numeric and I have to keep them.

like image 617
user3635284 Avatar asked Oct 28 '16 19:10

user3635284


People also ask

How do you calculate mean of multiple columns in pandas?

To calculate the mean of whole columns in the DataFrame, use pandas.Series.mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.mean(), use axis=0 argument to calculate the column-wise mean of the DataFrame.

How do you find the mean of a specific column in pandas?

mean() function return the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the dataframe.

Can we have two columns with same name in pandas?

Other statistical languages more stringently guard against duplicate column names. Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language.

How can you tell if two columns are identical with pandas?

Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.


1 Answers

The basic idea is that you can group by your columns names and do mean operations for each group.

I saw some comments for your question and tried to give you different ways to achieve the goal. (Solution (3) is the best I found!)

(1) Quick solution. If you have very limited columns that are non-numeric, and own unique names, e.g., columns id and name. What you can do is:

First set index ['id', 'name'] to preserve them,

df = df.set_index(['id', 'name']) 

then use DataFrame.groupby function on columns, set axis=1 (iterate over each column), apply mean function for each group.

df.groupby(by=df.columns, axis=1).mean()

And finally, reset index to recover ['id', 'name'] columns

df = df.reset_index()

Here is a sample code:

In [35]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [36]: df = df.set_index(['id', 'name'])

In [37]: df = df.groupby(by=df.columns, axis=1).mean()

In [38]: df = df.reset_index()

In [39]: df
Out[39]: 
    id name  c1   c2    c3
0  001    a   1   55  1000
1  002    b   2  110  2000

(2) Complete solution. If you have lots of columns that are non-numeric and unique named, what you can do is:

First transpose you dataframe,

df2 = df.transpose()

Then you do group by operations (on its index and axis=0), but carefully handle each groups: for these numeric groups, return their mean value; and for these non-numeric groups, return their first row:

df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

And finally, transpose back:

df = df2.transpose()

Here is sample of code:

In [98]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))

In [99]: df2 = df.transpose()

In [100]: df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])

In [101]: df3 = df2.transpose()

In [102]: df3
Out[102]: 
  c1   c2    c3   id name
A  1   55  1000  001    a
B  2  110  2000  002    b

In [103]: df
Out[103]: 
    id name  c1  c2   c2    c3
A  001    a   1  10  100  1000
B  002    b   2  20  200  2000

You need to import numbers

More notes:

(3) All in one! This solution is the best I found:

df.groupby(by=df.columns, axis=1).apply(lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0])

I tried to handle each group for the un-transposed groups, that is,

df.groupby(by=df.columns, axis=1).apply(gf)

And

gf = lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0]

I failed before, because I do not carefully hand the axis. You must set axis=1 for mean function, and return columns for non-numeric groups.

Thanks!

like image 173
rojeeer Avatar answered Oct 25 '22 14:10

rojeeer