I want to add a column to a df. The values of this new df will be dependent upon the values of the other columns. eg
dc = {'A':[0,9,4,5],'B':[6,0,10,12],'C':[1,3,15,18]}
df = pd.DataFrame(dc)
A B C
0 0 6 1
1 9 0 3
2 4 10 15
3 5 12 18
Now I want to add another column D whose values will depend on values of A,B,C. So for example if was iterating through the df I would just do:
for row in df.iterrows():
if(row['A'] != 0 and row[B] !=0):
row['D'] = (float(row['A'])/float(row['B']))*row['C']
elif(row['C'] ==0 and row['A'] != 0 and row[B] ==0):
row['D'] == 250.0
else:
row['D'] == 20.0
Is there a way to do this without the for loop or using where () or apply () functions.
Thanks
apply
should work well for you:
In [20]: def func(row):
if (row == 0).all():
return 250.0
elif (row[['A', 'B']] != 0).all():
return (float(row['A']) / row['B'] ) * row['C']
else:
return 20
....:
In [21]: df['D'] = df.apply(func, axis=1)
In [22]: df
Out[22]:
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
[4 rows x 4 columns]
.where
can be much faster than .apply
, so if all you're doing is if/elses then I'd aim for .where
. As you're returning scalars in some cases, np.where
will be easier to use than pandas' own .where
.
import pandas as pd
import numpy as np
df['D'] = np.where((df.A!=0) & (df.B!=0), ((df.A/df.B)*df.C),
np.where((df.C==0) & (df.A!=0) & (df.B==0), 250,
20))
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
For a tiny df like this, you wouldn't need to worry about speed. However, on a 10000 row df of randn, this is almost 2000 times faster than the .apply
solution above: 3ms vs 5850ms. That said if speed isn't a concern, then .apply can often be easier to read.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With