I have a dataframe with columns like:
['id','name','foo1', 'foo1', 'foo1', 'foo2','foo2', 'foo3']
I would like to get a new dataframe where columns sharing the same name are averaged:
['id','name','foo1', 'foo2','foo3']
Here column foo1 would be the average of the three columns named foo1 in the original dataframe, foo2 would be the average of the two columns named foo2 and foo3 would be just foo3
Note: id and name are not numeric and I have to keep them.
To calculate the mean of whole columns in the DataFrame, use pandas.Series.mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.mean(), use axis=0 argument to calculate the column-wise mean of the DataFrame.
mean() function return the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the dataframe.
Other statistical languages more stringently guard against duplicate column names. Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language.
Method 2: Using equals() methods. This method Test whether two-column contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The basic idea is that you can group by your columns names and do mean operations for each group.
I saw some comments for your question and tried to give you different ways to achieve the goal. (Solution (3) is the best I found!)
(1) Quick solution. If you have very limited columns that are non-numeric, and own unique names, e.g., columns id
and name
. What you can do is:
First set index ['id', 'name']
to preserve them,
df = df.set_index(['id', 'name'])
then use DataFrame.groupby
function on columns
, set axis=1
(iterate over each column), apply mean
function for each group.
df.groupby(by=df.columns, axis=1).mean()
And finally, reset index to recover ['id', 'name']
columns
df = df.reset_index()
Here is a sample code:
In [35]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))
In [36]: df = df.set_index(['id', 'name'])
In [37]: df = df.groupby(by=df.columns, axis=1).mean()
In [38]: df = df.reset_index()
In [39]: df
Out[39]:
id name c1 c2 c3
0 001 a 1 55 1000
1 002 b 2 110 2000
(2) Complete solution. If you have lots of columns that are non-numeric and unique named, what you can do is:
First transpose you dataframe,
df2 = df.transpose()
Then you do group by operations (on its index and axis=0
), but carefully handle each groups: for these numeric groups, return their mean value; and for these non-numeric groups, return their first row:
df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])
And finally, transpose back:
df = df2.transpose()
Here is sample of code:
In [98]: df = pd.DataFrame([['001', 'a', 1, 10, 100, 1000], ['002', 'b', 2, 20, 200, 2000]], columns=['id', 'name', 'c1', 'c2', 'c2', 'c3'], index=list('AB'))
In [99]: df2 = df.transpose()
In [100]: df2 = df2.groupby(by=df2.index, axis=0).apply(lambda g: g.mean() if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[0])
In [101]: df3 = df2.transpose()
In [102]: df3
Out[102]:
c1 c2 c3 id name
A 1 55 1000 001 a
B 2 110 2000 002 b
In [103]: df
Out[103]:
id name c1 c2 c2 c3
A 001 a 1 10 100 1000
B 002 b 2 20 200 2000
You need to import numbers
More notes:
(3) All in one! This solution is the best I found:
df.groupby(by=df.columns, axis=1).apply(lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0])
I tried to handle each group for the un-transposed groups, that is,
df.groupby(by=df.columns, axis=1).apply(gf)
And
gf = lambda g: g.mean(axis=1) if isinstance(g.iloc[0,0], numbers.Number) else g.iloc[:,0]
I failed before, because I do not carefully hand the axis. You must set axis=1
for mean
function, and return columns for non-numeric groups.
Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With