I have a DataFrame, df
, containing several columns. Some of the values in df
are NaN
. I want to replace each NaN
with a valid value, chosen by randomly sampling from other values in the given column.
For instance, if:
df[work] = [4, 7, NaN, 4]
I'd like to replace df[work][2]
with 4 2/3 of the time and 7 1/3 of the time.
Here's my attempt:
def resample_fillna(df):
for col in df.columns:
# get series consisting of non-NaN values
valid_series = df[col].dropna()
nan_indices = np.argwhere(np.isnan(df[col]))
for nan_index in nan_indices:
df[col][nan_index] = valid_series.sample(n=1)
I'm thinking there's a much better, more Pythonic way. Any thoughts?
Thanks!
Let's create some fake data and then fill the missing values with random other values from the same column.
np.random.seed(123)
data = np.random.randint(0, 10, (10,5))
df = pd.DataFrame(data, columns=list('abcde'))
df = df.where(df > 2)
df
a b c d e
0 NaN NaN 6.0 NaN 3.0
1 9.0 6.0 NaN NaN NaN
2 9.0 NaN NaN 9.0 3.0
3 4.0 NaN NaN 4.0 NaN
4 7.0 3.0 NaN 4.0 7.0
5 NaN 4.0 8.0 NaN 7.0
6 9.0 3.0 4.0 6.0 NaN
7 5.0 6.0 NaN NaN 8.0
8 3.0 5.0 NaN NaN 6.0
9 NaN 4.0 4.0 6.0 3.0
Now we can loop through each column with apply
and sample with replacement from the non-missing values.
df.apply(lambda x: np.where(x.isnull(), x.dropna().sample(len(x), replace=True), x))
a b c d e
0 5.0 3.0 6.0 6.0 3.0
1 9.0 6.0 4.0 9.0 7.0
2 9.0 5.0 8.0 9.0 3.0
3 4.0 3.0 8.0 4.0 6.0
4 7.0 3.0 4.0 4.0 7.0
5 9.0 4.0 8.0 6.0 7.0
6 9.0 3.0 4.0 6.0 3.0
7 5.0 6.0 4.0 4.0 8.0
8 3.0 5.0 4.0 4.0 6.0
9 9.0 4.0 4.0 6.0 3.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With