Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Filling NA values to be filled based on distribution of existing values

I have a pandas data-frame in which one column sign up has multiple null values. The sign up column has categorical values that includes multiple OS such as iOS, android, web etc. I would like to fill the NA values from the existing OS values but the NA values should be filled as per the existing distribution of OS values.

Example: Lets say, the dataset has OS values count distribution as follows:

signup
android web    14
ios web        16
mac            5
other          3
windows        6
Name: id, dtype: int64

I would like to fill the NA values based on the above distribution of the distinct OS values. The reason that I would like to do is to maintain the current distribution as filling with Mode value would likely to skew the results. Can someone help on how to achieve this.

like image 813
user4943236 Avatar asked Jul 02 '17 03:07

user4943236


People also ask

How do I fill NA values in pandas?

There are two approaches to replace NaN values with zeros in Pandas DataFrame: fillna(): function fills NA/NaN values using the specified method. replace(): df. replace()a simple method used to replace a string, regex, list, dictionary.

Which function is used to fill all NaN values in pandas object with the given value?

The fillna() function is used to fill NA/NaN values using the specified method. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame).

How do I fill NA values in a column?

fillna() method is used to fill NaN/NA values on a specified column or on an entire DataaFrame with any given value. You can specify modify using inplace, or limit how many filling to perform or choose an axis whether to fill on rows/column etc. The Below example fills all NaN values with None value.


2 Answers

You could use something like Numpy's random.choice

starting with a frame fitting your description

import numpy as np
import pandas as pd

print(df)
    id   signup
0    1      mac
1    2      mac
2    3      mac
3    4    other
4    5    other
5    6  windows
6    7  windows
7    8  windows
8    9  windows
9   10      NaN
10  11      NaN
11  12      NaN
12  13      NaN
13  14      NaN

Updated using piRSquared's tip in the comments figuring out the current distribution

s = df.signup.value_counts(normalize=True)
print(s)
windows    0.444444
mac        0.333333
other      0.222222
Name: signup, dtype: float64

We'll use boolean indexing next to filter by the nans we want to update. Also, this is where we use the random choice by passing the index (windows, mac, other), the size needed and the distribution of each signup will be used for the probabilities(p) parameter.

missing = df['signup'].isnull()
df.loc[missing,'signup'] = np.random.choice(s.index, size=len(df[missing]),p=s.values)
print(df)

    id   signup
0    1      mac
1    2      mac
2    3      mac
3    4    other
4    5    other
5    6  windows
6    7  windows
7    8  windows
8    9  windows
9   10  windows
10  11  windows
11  12  mac
12  13  windows
13  14    other
like image 193
Bob Haffner Avatar answered Oct 01 '22 19:10

Bob Haffner


  • find nulls
  • sample from non-nulls the amount of nulls. make sure to set replace=True
  • assign sampled values to null positions

isnull = df.signup.isnull()
sample = df.signup.dropna().sample(isnull.sum(), replace=True).values
df.loc[isnull, 'signup'] = sample
like image 23
piRSquared Avatar answered Oct 01 '22 18:10

piRSquared