Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly introduce NaN values in pandas dataframe

How could I randomly introduce NaN values ​​into my dataset for each column taking into account the null values ​​already in my starting data.

I want to have for example 20% of NaN values ​​by column.

For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values ​​by column to reach 20% per column:

A: 10% nan
B: 15% nan
C: 8% nan

For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:

df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
like image 552
Ib D Avatar asked Jan 23 '19 15:01

Ib D


2 Answers

I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.

import numpy as np
import pandas as pd

A = pd.Series(np.arange(99))

# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN

###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()

# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)

nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN

A.isna().mean()

Obviously, it will not always be exactly 20%...

Update Applying it for the whole dataframe

for col in df:
    ori_rat = df[col].isna().mean()

    if ori_rat >= 0.2: continue

    add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
    vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
    df.loc[vals_to_nan, col] = np.NaN

Update 2 I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.

like image 101
nocibambi Avatar answered Sep 30 '22 20:09

nocibambi


Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))

nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}

for col in df:
    for i, row_value in df[col].iteritems():
        if random.random() <= nan_percent[col]:
            df[col][i] = np.nan
#after adding nan            
print(df.head(10))
like image 26
alec_djinn Avatar answered Sep 30 '22 19:09

alec_djinn