Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling missing data by random choosing from non missing values in pandas dataframe

I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.

For instance:

import pandas as pd
import random
import numpy as np

foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
    A   B
0   2 NaN
1   3   4
2 NaN   2   
3   5 NaN
4 NaN   5

I would like for instance foo['A'][2]=2 and foo['A'][5]=3 The shape of my pandas DataFrame is (6940,154). I try this

foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))

But it not working. Could you help me achieve that? Best regards.

like image 692
Donald Gedeon Avatar asked Apr 04 '16 21:04

Donald Gedeon


People also ask

How do you find non missing values in pandas?

notna() function detects existing/ non-missing values in the dataframe. The function returns a boolean object having the same size as that of the object on which it is applied, indicating whether each individual value is a na value or not.

Which method is used to fill in the blanks or missing values in a DataFrame?

Pandas Dataframe method in Python such as fillna can be used to replace the missing values. Methods such as mean(), median() and mode() can be used on Dataframe for finding their values.


4 Answers

This works well for me on Pandas DataFrame

def randomiseMissingData(df2):
    "randomise missing data for DataFrame (within a column)"
    df = df2.copy()
    for col in df.columns:
        data = df[col]
        mask = data.isnull()
        samples = random.choices( data[~mask].values , k = mask.sum() )
        data[mask] = samples

return df
like image 174
Karolis Avatar answered Nov 10 '22 17:11

Karolis


I did this for filling NaN values with a random non-NaN value:

import random

df['column'].fillna(random.choice(df['column'][df['column'].notna()]), inplace=True)
like image 34
mohannatd Avatar answered Nov 10 '22 17:11

mohannatd


You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.

import random
import numpy as np

df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)

Where column is the column you want to fill with non nan values randomly.

like image 8
bamdan Avatar answered Nov 10 '22 17:11

bamdan


This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation

foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)
like image 3
Espoir Murhabazi Avatar answered Nov 10 '22 18:11

Espoir Murhabazi