Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly delete 10% attributes values from df in pandas

Tags:

python

pandas

I have a example dataset. It has 2000 rows and 15 columns. Last columns will be need as decision class in classification.

I need to delete randomly 10% of attributes values. So 10% values from columns 0-13 should be NA.

I wrote a for loop. It randomizes a colNumber (0-13) and rowNumber (0-2000) and it replaces a value to NA. But I think (and I see this) it's not a faster solution. I tried to find something else in pandas, not core python, but couldn't find anything.

Maybe someone have better idea? More pandas solution? Or maybe something completely different?

like image 467
martin Avatar asked Oct 23 '25 15:10

martin


1 Answers

You can make use of pandas' sample method.

Imports and set up data

import numpy as np
import pandas as pd

n = 100
data = {
    'a': np.random.random(size=n),
    'b': np.random.choice(list(string.ascii_lowercase), size=n),
    'c': np.random.random(size=n),
}

df = pd.DataFrame(data)

Solution

for col in df.columns:
    df.loc[df.sample(frac=0.1).index, col] = np.nan

Solution without for loop:

def delete_10(col):
    col.loc[col.sample(frac=0.1).index] = np.nan
    return col

df.apply(delete_10, axis=0)

Check

Check to see proportion of NaN values:

df.isnull().sum() / len(df)

Output:

a    0.1
b    0.1
c    0.1
dtype: float64
like image 137
Chris Avatar answered Oct 25 '25 04:10

Chris