Pandas: Filling NA values to be filled based on distribution of existing values

Tags:

I have a pandas data-frame in which one column sign up has multiple null values. The sign up column has categorical values that includes multiple OS such as iOS, android, web etc. I would like to fill the NA values from the existing OS values but the NA values should be filled as per the existing distribution of OS values.

Example: Lets say, the dataset has OS values count distribution as follows:

signup
android web    14
ios web        16
mac            5
other          3
windows        6
Name: id, dtype: int64

I would like to fill the NA values based on the above distribution of the distinct OS values. The reason that I would like to do is to maintain the current distribution as filling with Mode value would likely to skew the results. Can someone help on how to achieve this.

813

asked Jul 02 '17 03:07

user4943236

2 Answers

You could use something like Numpy's random.choice

starting with a frame fitting your description

import numpy as np
import pandas as pd

print(df)
    id   signup
0    1      mac
1    2      mac
2    3      mac
3    4    other
4    5    other
5    6  windows
6    7  windows
7    8  windows
8    9  windows
9   10      NaN
10  11      NaN
11  12      NaN
12  13      NaN
13  14      NaN

Updated using piRSquared's tip in the comments figuring out the current distribution

s = df.signup.value_counts(normalize=True)
print(s)
windows    0.444444
mac        0.333333
other      0.222222
Name: signup, dtype: float64

We'll use boolean indexing next to filter by the nans we want to update. Also, this is where we use the random choice by passing the index (windows, mac, other), the size needed and the distribution of each signup will be used for the probabilities(p) parameter.

missing = df['signup'].isnull()
df.loc[missing,'signup'] = np.random.choice(s.index, size=len(df[missing]),p=s.values)
print(df)

    id   signup
0    1      mac
1    2      mac
2    3      mac
3    4    other
4    5    other
5    6  windows
6    7  windows
7    8  windows
8    9  windows
9   10  windows
10  11  windows
11  12  mac
12  13  windows
13  14    other

193

answered Oct 01 '22 19:10

Bob Haffner

find nulls
sample from non-nulls the amount of nulls. make sure to set replace=True
assign sampled values to null positions

isnull = df.signup.isnull()
sample = df.signup.dropna().sample(isnull.sum(), replace=True).values
df.loc[isnull, 'signup'] = sample

answered Oct 01 '22 18:10

piRSquared

Related questions
                            
                                Difference between os.system("pwd") and os.getcwd()
                            
                                Set or modify an AWS Lambda environment variable with Python boto3
                            
                                Python: Why operator "is" and "==" are sometimes interchangeable for strings? [duplicate]
                            
                                Python - get slice index
                            
                                django tutorials: 500 @ debug=false
                            
                                Why does dict(k=4, z=2).update(dict(l=1)) return None in Python?
                            
                                How to parse date days that contain "st", "nd", "rd", or "th"?
                            
                                How to add a character to the end of every string in a list? [duplicate]
                            
                                mysql.connector, multi=True, sql variable assignment not working
                            
                                print UTF-8 character in Python 2.7
                            
                                Bottle loading time for network server is extremely slow
                            
                                Do python "in" statements automatically return as true
                            
                                Is list join really faster than string concatenation in python?
                            
                                Getting the Max Value from a Dictionary [duplicate]
                            
                                Append binary file to another binary file
                            
                                Peek the number of rows in an hdf5 file in pandas
                            
                                Get info string from scapy packet
                            
                                Django 1.9 Compiling Error
                            
                                Check string "None" or "not" in Python 2.7
                            
                                How to Use a Wildcard (%) in Pandas read_sql()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Filling NA values to be filled based on distribution of existing values

Tags:

python-3.x

pandas

numpy

python-2.7

user4943236

People also ask

2 Answers

Bob Haffner

piRSquared

Recent Activity

Donate For Us