I have a df like so:
Year ID Count
1997 1 0
1998 2 0
1999 3 1
2000 4 0
2001 5 1
and I want to remove all rows before the first occurrence of 1
in Count
which would give me:
Year ID Count
1999 3 1
2000 4 0
2001 5 1
I can remove all rows AFTER the first occurrence like this:
df=df.loc[: df[(df['Count'] == 1)].index[0], :]
but I can't seem to follow the slicing logic to make it do the opposite.
I'd do:
df[(df.Count == 1).idxmax():]
df.Count == 1
returns a boolean array. idxmax()
will identify the index of the maximum value. I know the max value will be True
and when there are more than one True
s it will return the position of the first one found. That's exactly what you want. By the way, that value is 2
. Finally, I slice the dataframe for everything from 2
onward with df[2:]
. I put all that in one line in the answer above.
you can use cumsum() method:
In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
Year ID Count
2 1999 3 1
3 2000 4 0
4 2001 5 1
Explanation:
In [14]: (df.Count == 1).cumsum()
Out[14]:
0 0
1 0
2 1
3 1
4 2
Name: Count, dtype: int32
Timing against 500K rows DF:
In [18]: df = pd.concat([df] * 10**5, ignore_index=True)
In [19]: df.shape
Out[19]: (500000, 3)
In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop
In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop
In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop
Conclusion: @piRSquared's idxmax()
solution is a clear winner...
Using np.where
:
df[np.where(df['Count']==1)[0][0]:]
Timings
Timings were performed on a larger version of the DataFrame:
df = pd.concat([df]*10**5, ignore_index=True)
Results:
%timeit df[np.where(df['Count']==1)[0][0]:]
100 loops, best of 3: 2.74 ms per loop
%timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 6.18 ms per loop
%timeit df[(df.Count == 1).cumsum() > 0]
10 loops, best of 3: 26.6 ms per loop
%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With