Pandas: Sum values in two different columns using loc[] as assign as a new column. We selected the columns 'Jan' & 'Feb' using loc[] and got a mini dataframe which contains only these two columns. Then called the sum() with axis=1, which added the values in all the columns and returned a Series object.
Pandas DataFrame sum() MethodThe sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.
I think you've misunderstood some python syntax, the following does two assignments:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
sum = df['budget'] + df['actual'] # a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sum
as a variable name as the overrides the built-in sum function.
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)
Same thing can be done using lambda function. Here I am reading the data from a xlsx file.
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
You could also use the .add()
function:
df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])
This is the most elegant solution which follows DRY and work absolutely great.
dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)
Thank you.
If "budget" has any NaN values but you don't want it to sum to NaN then try:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With