Filling missing data by random choosing from non missing values in pandas dataframe

Tags:

I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.

For instance:

import pandas as pd
import random
import numpy as np

foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
    A   B
0   2 NaN
1   3   4
2 NaN   2   
3   5 NaN
4 NaN   5

I would like for instance foo['A'][2]=2 and foo['A'][5]=3 The shape of my pandas DataFrame is (6940,154). I try this

foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))

But it not working. Could you help me achieve that? Best regards.

692

asked Apr 04 '16 21:04

Donald Gedeon

4 Answers

This works well for me on Pandas DataFrame

def randomiseMissingData(df2):
    "randomise missing data for DataFrame (within a column)"
    df = df2.copy()
    for col in df.columns:
        data = df[col]
        mask = data.isnull()
        samples = random.choices( data[~mask].values , k = mask.sum() )
        data[mask] = samples

return df

174

answered Nov 10 '22 17:11

Karolis

I did this for filling NaN values with a random non-NaN value:

import random

df['column'].fillna(random.choice(df['column'][df['column'].notna()]), inplace=True)

answered Nov 10 '22 17:11

mohannatd

You can use pandas.fillna method and the random.choice method to fill the missing values with a random selection of a particular column.

import random
import numpy as np

df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)

Where column is the column you want to fill with non nan values randomly.

answered Nov 10 '22 17:11

bamdan

This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation

foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)

answered Nov 10 '22 18:11

Espoir Murhabazi

Related questions
                            
                                networkx - change node size based on list or dictionary value
                            
                                Upload an image from Django shell
                            
                                Flask-RESTful custom routes other than GET,PUT,POST,DELETE
                            
                                Pandas: apply different functions to different columns
                            
                                Skip a behave step in the step implementation
                            
                                Python string formatting - limit string length, but trim string beginning
                            
                                PyQt Connect to KeyPressEvent
                            
                                Pytest - How to pass an argument to setup_class?
                            
                                Asyncio event loop per python process (aioprocessing, multiple event loops)
                            
                                classmethod property TypeError: 'property' object is not iterable
                            
                                Is there an error in Python 3's random.SystemRandom.randint, or am I using in incorrectly?
                            
                                Slow scrolling down the page using Selenium
                            
                                What does the first argument of the imp.load_source method do?
                            
                                how to get all objects by instance in django
                            
                                ImportError: cannot import name constants
                            
                                Instance of Python class that responds to all method calls
                            
                                Find out if there is input from a pipe or not in Python?
                            
                                Copy image to clipboard?
                            
                                using pandas and numpy to parametrize stack overflow's number of users and reputation
                            
                                How to set default value for variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filling missing data by random choosing from non missing values in pandas dataframe

Tags:

python

pandas

missing-data

Donald Gedeon

People also ask

4 Answers

Karolis

mohannatd

bamdan

Espoir Murhabazi

Recent Activity

Donate For Us