Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient way to randomly null out values in dataframe

consider df

df = pd.DataFrame(np.ones((10, 10)) * 2,
                  list('abcdefghij'), list('ABCDEFGHIJ'))
df

enter image description here

How can I nullify ~20% of these values at random?

enter image description here

like image 381
piRSquared Avatar asked Dec 11 '22 14:12

piRSquared


1 Answers

You could use numpy.random.choice to generate a mask:

import numpy as np

mask = np.random.choice([True, False], size=df.shape, p=[.2,.8])

df.mask(mask)

In one line:

df.mask(np.random.choice([True, False], size=df.shape, p=[.2,.8]))

Speed tested using timeit at ~770μs:

>>> python -m timeit -n 10000 
        -s "import pandas as pd;import numpy as np;df=pd.DataFrame(np.ones((10,10))*2)"
        "df.mask(np.random.choice([True,False], size=df.shape, p=[.2,.8]))"
10000 loops, best of 3: 770 usec per loop
like image 171
ASGM Avatar answered Feb 23 '23 08:02

ASGM