I have a example dataset. It has 2000 rows and 15 columns. Last columns will be need as decision class in classification.
I need to delete randomly 10% of attributes values. So 10% values from columns 0-13 should be NA.
I wrote a for loop. It randomizes a colNumber (0-13) and rowNumber (0-2000) and it replaces a value to NA. But I think (and I see this) it's not a faster solution. I tried to find something else in pandas, not core python, but couldn't find anything.
Maybe someone have better idea? More pandas solution? Or maybe something completely different?
You can make use of pandas' sample method.
import numpy as np
import pandas as pd
n = 100
data = {
    'a': np.random.random(size=n),
    'b': np.random.choice(list(string.ascii_lowercase), size=n),
    'c': np.random.random(size=n),
}
df = pd.DataFrame(data)
for col in df.columns:
    df.loc[df.sample(frac=0.1).index, col] = np.nan
def delete_10(col):
    col.loc[col.sample(frac=0.1).index] = np.nan
    return col
df.apply(delete_10, axis=0)
Check to see proportion of NaN values:
df.isnull().sum() / len(df)
Output:
a    0.1
b    0.1
c    0.1
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With