Here's a simple example of the sort of thing I'm wrestling with:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: test = pd.DataFrame(np.random.randn(4,4),columns=list('ABCD'))
In [4]: for i in range(4):
....: test.iloc[i,i] = np.nan
In [5]: test
Out[5]:
A B C D
0 NaN 0.136841 -0.854138 -1.890888
1 -1.261724 NaN 0.875647 1.312823
2 1.130999 -0.208402 NaN 0.256644
3 -0.158458 -0.305250 0.902756 NaN
Now, if I use sum
to sum the rows, all the NaN
values are treated as zeros:
In [6]: test['Sum'] = test.loc[:,'A':'D'].sum(axis=1)
In [7]: test
Out[7]:
A B C D Sum
0 NaN 0.136841 -0.854138 -1.890888 -2.608185
1 -1.261724 NaN 0.875647 1.312823 0.926745
2 1.130999 -0.208402 NaN 0.256644 1.179241
3 -0.158458 -0.305250 0.902756 NaN 0.439048
But in my case, I may need to do a bit of work on the values first; for example scaling them:
In [8]: test['Sum2'] = test.A + test.B/2 - test.C/3 + test.D
In [9]: test
Out[9]:
A B C D Sum Sum2
0 NaN 0.136841 -0.854138 -1.890888 -2.608185 NaN
1 -1.261724 NaN 0.875647 1.312823 0.926745 NaN
2 1.130999 -0.208402 NaN 0.256644 1.179241 NaN
3 -0.158458 -0.305250 0.902756 NaN 0.439048 NaN
As you see, the NaN
values carry across into the arithmetic to produce NaN
output, which is what you'd expect.
Now, I don't want to replace all NaN
values in my dataframe with zeros: it is helpful to me to distinguish between zero and NaN
. I could replace NaN
with something else: I'm dealing with large volumes of student grades, and i need to distinguish between a grade of zero, and a NaN
which at the moment I'm using to indicate that the particular assessment task was not attempted. (It takes the place of what would be a blank cell in a traditional spreadsheet.) But whatever I replace the NaN
values with, it needs to be something that can be treated as zero in the operations I may perform. What are my options here?
If a column in your data frame has NaN or None values, you can use the “fillna()” or “replace()” functions to fill them by zero (0).
nan_to_num() function is used when we want to replace nan(Not A Number) with zero and inf with finite numbers in an array. It returns (positive) infinity with a very large number and negative infinity with a very small (or negative) number.
Use the fillna function
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
If the dataframe is not huge you can try:
test["Sum"] = test.sum(axis=1)
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
It will be interesting to know if there is a way to do the second sum in one line only.
Update
if you have 1e5
rows or less the method I suggested is slightly faster than the one suggested by kmcodes, then things changes.
n = int(1e5)
test = pd.DataFrame(np.random.randn(n,4),columns=list('ABCD'))
for i in range(4):
test.iloc[i,i] = np.nan
%%timeit
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
3.95 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
4.12 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update 2
I found this
In your case you can just
weights = [1, 1/2, -1/3, 1]
test["Sum2"] = test.fillna(0).mul(weights).sum(axis=1)
keep in mind that this seems to be consistently slower than the other two.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With