Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop Rows of an id after a particular column value in Pandas

I have a dataset like:

Id   Status

1     0
1     0
1     0
1     0
1     1
2     0
1     0
2     0
3     0
3     0

I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:

Id   Status

1     0
1     0
1     0
1     0
1     1
2     0
2     0
3     0
3     0

i.e.

1     0   --> gets removed since this row appears after id 1 already had a status of 1

How to implement it efficiently since I have a very large (200 GB+) dataset.

Thanks for your help.

like image 289
tstseby Avatar asked Apr 06 '19 03:04

tstseby


2 Answers

Here's an idea;

You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):

d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))

Then you create a column with the first status=1 for each Id:

df["first"] = df["Id"].map(d)

Finally you drop every row where the index is less than than the first column:

df = df.loc[df.index<df["first"]]
like image 174
Toby Petty Avatar answered Nov 14 '22 00:11

Toby Petty


EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:

df[df.groupby('Id')['Status'].cumsum() < 1]

The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

Output:

   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
6   2       0
7   3       0
8   3       0
like image 30
gmds Avatar answered Nov 14 '22 01:11

gmds