Given that I have a pandas Series, I want to fill the NaNs with zero if either all the values are NaN or if all the values are either zero or NaN.
For example, I would want to fill the NaNs in the following Series with zeroes.
0 0
1 0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
But, I would not want to fillna(0) the following Series:
0 0
1 0
2 2
3 0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
I was looking at the documentation and it seems like I could use pandas.Series.value_counts to ensure the values are only 0 and NaN, and then simply call fillna(0).In other words, I am looking to check if set(s.unique().astype(str)).issubset(['0.0','nan']), THEN fillna(0), otherwise do not.
Considering how powerful pandas is, it seemed like a there may be a better way to do this. Does anyone have any suggestions to do this cleanly and efficiently?
Potential solution thanks to cᴏʟᴅsᴘᴇᴇᴅ
if s.dropna().eq(0).all():
s = s.fillna(0)
You can compare by 0
and isna
if only NaN
s and 0
and then fillna
:
if ((s == 0) | (s.isna())).all():
s = pd.Series(0, index=s.index)
Or compare unique values:
if pd.Series(s.unique()).fillna(0).eq(0).all():
s = pd.Series(0, index=s.index)
@cᴏʟᴅsᴘᴇᴇᴅ solution, thank you - compare Series without NaN
s with dropna
:
if s.dropna().eq(0).all():
s = pd.Series(0, index=s.index)
Solution from question - need convert to string
s, because problem with compare with NaN
s:
if set(s.unique().astype(str)).issubset(['0.0','nan']):
s = pd.Series(0, index=s.index)
Timings:
s = pd.Series(np.random.choice([0,np.nan], size=10000))
In [68]: %timeit ((s == 0) | (s.isna())).all()
The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 574 µs per loop
In [69]: %timeit pd.Series(s.unique()).fillna(0).eq(0).all()
1000 loops, best of 3: 587 µs per loop
In [70]: %timeit s.dropna().eq(0).all()
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 774 µs per loop
In [71]: %timeit set(s.unique().astype(str)).issubset(['0.0','nan'])
The slowest run took 5.78 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 157 µs per loop
Create a mask for the null values. Check if the length of the mask equals the length of the series (in which case the series are either all null values or empty) or if the non-nulls are all equal to zero. If so, create a new series of zero values using the original index from the series.
nulls = s.isnull()
if len(nulls) == len(s) or s[~nulls].eq(0).all():
s = pd.Series(0, index=s.index)
TIMINGS
%%timeit s_ = pd.concat([s] * 100000)
nulls = s_.isnull()
if len(nulls) == len(s_) or s_[~nulls].eq(0).all():
s_ = pd.Series(0, index=s_.index)
# 100 loops, best of 3: 2.33 ms per loop
# OP's solution:
%%timeit s_ = pd.concat([s] * 100000)
if s_.dropna().eq(0).all():
s_ = s_.fillna(0)
# 10 loops, best of 3: 19.7 ms per loop
# @Jezrael's fastest solution:
%%timeit s_ = pd.concat([s] * 100000)
if set(s_.unique().astype(str)).issubset(['0.0','nan']):
s_ = pd.Series(0, index=s_.index)
# 1000 loops, best of 3: 4.58 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With