How can I randomly make some values missing in a panda dataframe, as in Randomly insert NA's values in a pandas dataframe but make sure no row is set completely with missing values?
Edit: Sorry for not stating this explicitly again (it was in the question I referenced though): I need to be able to specify how much percentage, for example 10%, of the cells is supposed to be NaN
(or rather, as close to 10% as can be achieved with the existing data frame's size), as opposed to, say, clearing cells independently with a marginal per-cell probability of 10%.
You can use DataFrame.mask
and for numpy boolean mask
is used answer of this my question:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
np.random.seed(100)
mask = np.random.choice([True, False], size=df.shape)
print (mask)
[[ True True False]
[False False False]
[ True True True]] -> problematic values - all True
mask[mask.all(1),-1] = 0
print (mask)
[[ True True False]
[False False False]
[ True True False]]
print (df.mask(mask))
A B C
0 NaN NaN 7
1 2.0 5.0 8
2 NaN NaN 9
Here is an answer based on Randomly insert NA's values in a pandas dataframe:
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.1*len(ix)))
for row, col in ix:
if len(replaced[row]) < df.shape[1] - 1:
df.iloc[row, col] = np.nan
to_replace -= 1
replaced[row].add(col)
if to_replace == 0:
break
The shuffle operation will cause random order to the indexes and the if clause will avoid replacing the entire row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With