Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop all rows before first occurrence of a value

I have a df like so:

Year  ID Count
1997  1  0
1998  2  0
1999  3  1
2000  4  0
2001  5  1

and I want to remove all rows before the first occurrence of 1 in Count which would give me:

Year  ID Count
1999  3  1
2000  4  0
2001  5  1

I can remove all rows AFTER the first occurrence like this:

df=df.loc[: df[(df['Count'] == 1)].index[0], :]

but I can't seem to follow the slicing logic to make it do the opposite.

like image 859
Stefano Potter Avatar asked Aug 01 '16 20:08

Stefano Potter


3 Answers

I'd do:

df[(df.Count == 1).idxmax():]

enter image description here


df.Count == 1 returns a boolean array. idxmax() will identify the index of the maximum value. I know the max value will be True and when there are more than one Trues it will return the position of the first one found. That's exactly what you want. By the way, that value is 2. Finally, I slice the dataframe for everything from 2 onward with df[2:]. I put all that in one line in the answer above.

like image 148
piRSquared Avatar answered Sep 28 '22 20:09

piRSquared


you can use cumsum() method:

In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
   Year  ID  Count
2  1999   3      1
3  2000   4      0
4  2001   5      1

Explanation:

In [14]: (df.Count == 1).cumsum()
Out[14]:
0    0
1    0
2    1
3    1
4    2
Name: Count, dtype: int32

Timing against 500K rows DF:

In [18]: df = pd.concat([df] * 10**5, ignore_index=True)

In [19]: df.shape
Out[19]: (500000, 3)

In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop

In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop

In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop

Conclusion: @piRSquared's idxmax() solution is a clear winner...

like image 34
MaxU - stop WAR against UA Avatar answered Sep 28 '22 19:09

MaxU - stop WAR against UA


Using np.where:

df[np.where(df['Count']==1)[0][0]:]

Timings

Timings were performed on a larger version of the DataFrame:

df = pd.concat([df]*10**5, ignore_index=True)

Results:

%timeit df[np.where(df['Count']==1)[0][0]:]  
100 loops, best of 3: 2.74 ms per loop

%timeit df[(df.Count == 1).idxmax():] 
100 loops, best of 3: 6.18 ms per loop

%timeit df[(df.Count == 1).cumsum() > 0] 
10 loops, best of 3: 26.6 ms per loop

%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
like image 34
root Avatar answered Sep 28 '22 19:09

root