I have a df like so:
Year ID Count
1997 1 0
1998 2 0
1999 3 1
2000 4 0
2001 5 1
and I want to remove all rows before the first occurrence of 1 in Count which would give me:
Year ID Count
1999 3 1
2000 4 0
2001 5 1
I can remove all rows AFTER the first occurrence like this:
df=df.loc[: df[(df['Count'] == 1)].index[0], :]
but I can't seem to follow the slicing logic to make it do the opposite.
I'd do:
df[(df.Count == 1).idxmax():]

df.Count == 1 returns a boolean array. idxmax() will identify the index of the maximum value. I know the max value will be True and when there are more than one Trues it will return the position of the first one found. That's exactly what you want. By the way, that value is 2. Finally, I slice the dataframe for everything from 2 onward with df[2:]. I put all that in one line in the answer above.
you can use cumsum() method:
In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
Year ID Count
2 1999 3 1
3 2000 4 0
4 2001 5 1
Explanation:
In [14]: (df.Count == 1).cumsum()
Out[14]:
0 0
1 0
2 1
3 1
4 2
Name: Count, dtype: int32
Timing against 500K rows DF:
In [18]: df = pd.concat([df] * 10**5, ignore_index=True)
In [19]: df.shape
Out[19]: (500000, 3)
In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop
In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop
In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop
Conclusion: @piRSquared's idxmax() solution is a clear winner...
Using np.where:
df[np.where(df['Count']==1)[0][0]:]
Timings
Timings were performed on a larger version of the DataFrame:
df = pd.concat([df]*10**5, ignore_index=True)
Results:
%timeit df[np.where(df['Count']==1)[0][0]:]
100 loops, best of 3: 2.74 ms per loop
%timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 6.18 ms per loop
%timeit df[(df.Count == 1).cumsum() > 0]
10 loops, best of 3: 26.6 ms per loop
%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With