Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently fillna(0) if series is all-nan, or else remaining non-nan entries are zero?

Given that I have a pandas Series, I want to fill the NaNs with zero if either all the values are NaN or if all the values are either zero or NaN.

For example, I would want to fill the NaNs in the following Series with zeroes.

0       0
1       0
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

But, I would not want to fillna(0) the following Series:

0       0
1       0
2       2
3       0
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

I was looking at the documentation and it seems like I could use pandas.Series.value_counts to ensure the values are only 0 and NaN, and then simply call fillna(0).In other words, I am looking to check if set(s.unique().astype(str)).issubset(['0.0','nan']), THEN fillna(0), otherwise do not.

Considering how powerful pandas is, it seemed like a there may be a better way to do this. Does anyone have any suggestions to do this cleanly and efficiently?

Potential solution thanks to cᴏʟᴅsᴘᴇᴇᴅ

if s.dropna().eq(0).all():
    s = s.fillna(0)
like image 216
Lien Avatar asked Apr 23 '18 05:04

Lien


2 Answers

You can compare by 0 and isna if only NaNs and 0 and then fillna:

if ((s == 0) | (s.isna())).all():
    s = pd.Series(0, index=s.index)

Or compare unique values:

if pd.Series(s.unique()).fillna(0).eq(0).all():
    s = pd.Series(0, index=s.index)

@cᴏʟᴅsᴘᴇᴇᴅ solution, thank you - compare Series without NaNs with dropna:

 if s.dropna().eq(0).all():
    s = pd.Series(0, index=s.index)

Solution from question - need convert to strings, because problem with compare with NaNs:

if set(s.unique().astype(str)).issubset(['0.0','nan']):

    s = pd.Series(0, index=s.index)

Timings:

s = pd.Series(np.random.choice([0,np.nan], size=10000))

In [68]: %timeit ((s == 0) | (s.isna())).all()
The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 574 µs per loop

In [69]: %timeit pd.Series(s.unique()).fillna(0).eq(0).all()
1000 loops, best of 3: 587 µs per loop

In [70]: %timeit s.dropna().eq(0).all()
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 774 µs per loop

In [71]: %timeit set(s.unique().astype(str)).issubset(['0.0','nan'])
The slowest run took 5.78 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 157 µs per loop
like image 119
jezrael Avatar answered Oct 01 '22 15:10

jezrael


Create a mask for the null values. Check if the length of the mask equals the length of the series (in which case the series are either all null values or empty) or if the non-nulls are all equal to zero. If so, create a new series of zero values using the original index from the series.

nulls = s.isnull()
if len(nulls) == len(s) or s[~nulls].eq(0).all():
    s = pd.Series(0, index=s.index)

TIMINGS

%%timeit s_ = pd.concat([s] * 100000)
nulls = s_.isnull()
if len(nulls) == len(s_) or s_[~nulls].eq(0).all():
    s_ = pd.Series(0, index=s_.index)
# 100 loops, best of 3: 2.33 ms per loop

# OP's solution:
%%timeit s_ = pd.concat([s] * 100000)
if s_.dropna().eq(0).all():
    s_ = s_.fillna(0)
# 10 loops, best of 3: 19.7 ms per loop

# @Jezrael's fastest solution:
%%timeit s_ = pd.concat([s] * 100000)
if set(s_.unique().astype(str)).issubset(['0.0','nan']):
    s_ = pd.Series(0, index=s_.index)
# 1000 loops, best of 3: 4.58 ms per loop
like image 24
Alexander Avatar answered Oct 01 '22 13:10

Alexander