I have a pandas dataframe df which contains a column amount. For many rows, the amount is zero. I want to randomly remove 50% of the rows where the amount is zero, keeping all rows where amount is nonzero. How can I do this?
pandasUsing query + sample
df.drop(df.query('amount == 0').sample(frac=.5).index)
Consider the dataframe df
df = pd.DataFrame(dict(amount=[0, 1] * 10))
df.drop(df.query('amount == 0').sample(frac=.5).index)
numpyiszero = df.amount.values == 0
count_zeros = iszero.sum()
idx = np.arange(iszero.shape[0])
keep_these = np.random.choice(idx[iszero], int(iszero.sum() * .5), replace=False)
df.iloc[np.sort(np.concatenate([idx[~iszero], keep_these]))]
amount
1 1
2 0
3 1
5 1
6 0
7 1
8 0
9 1
10 0
11 1
12 0
13 1
15 1
17 1
19 1
time test

Per the comment from @tomcy, you can use the parameter inplace=True to remove the rows from df without having to reassign df
df.drop(df.query('amount == 0').sample(frac=.5).index, inplace=True)
df
amount
1 1
2 0
3 1
5 1
6 0
7 1
8 0
9 1
10 0
11 1
12 0
13 1
15 1
17 1
19 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With