Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Fill missing values by mean in each group faster than transform

I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform can achieve this.

However, transform is too slow for my purposes.

For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN values:

import pandas as pd
import numpy as np

size = 10000000  # DataFrame length
ngroups = 100  # Number of Groups

randgroups = np.random.randint(ngroups, size=size)  # Creation of groups
randvals = np.random.rand(size) * randgroups * 2    # Random values with mean like group number
nan_indices = np.random.permutation(range(size))    # NaN indices
nanfrac = 0.7                                       # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)]       # Take fraction of NaN indices
randvals[nan_indices] = np.NaN                      # Set NaN values

df = pd.DataFrame({'value': randvals, 'group': randgroups})  # Create data frame

Using transform via

df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long

takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).

So how can I fill the missing values any faster?

like image 843
SmCaterpillar Avatar asked Nov 18 '16 16:11

SmCaterpillar


People also ask

How do I fill multiple null values in pandas?

fillna() method is used to fill NaN/NA values on a specified column or on an entire DataaFrame with any given value. You can specify modify using inplace, or limit how many filling to perform or choose an axis whether to fill on rows/column etc. The Below example fills all NaN values with None value.

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

What is forward fill and backward fill pandas?

bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe. ffill() function is used forward fill the missing value in the dataframe. So this recipe is a short example on What is ffill and bfill in pandas.


3 Answers

Here's a NumPy approach using np.bincount that's pretty efficient for such bin-based summing/averaging operations -

ids = df.group.values                    # Extract 2 columns as two arrays
vals = df.value.values

m = np.isnan(vals)                             # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m))        # Group averages
vals[m] = avg_vals[ids[m]]              # Set avg values into NaN positions

Note that this would update the value column.

Runtime test

Datasizes :

size = 1000000  # DataFrame length
ngroups = 10  # Number of Groups

Timings :

In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop

In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop

In [19]: 276.0/13.6  # Speedup
Out[19]: 20.294117647058822

20x+ speedup there!

like image 178
Divakar Avatar answered Oct 18 '22 20:10

Divakar


you're doing it wrong. it's slow because you're using a lambda

df[['value']].fillna(df.groupby('group').transform('mean'))
like image 44
piRSquared Avatar answered Oct 18 '22 18:10

piRSquared


Using a Sorted Index + fillna()

You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78s to run.

  1. Example Code: %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean())) Output: 1 loop, best of 3: 3.18 s per loop`

  2. piRSquared's improvement: %%timeit df[['value']].fillna(df.groupby('group').transform('mean')) Output: 1 loop, best of 3: 2.78 s per loop

  3. Slightly more efficient way (using a sorted index and fillna):

You can set the group column as the index of the dataframe, and sort it.

df = df.set_index('group').sort_index()

Now that you have a sorted index, the it's super cheap to access a subset of the dataframe by the group number, by using df.loc[x,:]

Since you need to impute by the mean for every group, you need all the unique group id's. For this example, you could use range (since the groups are from 0 to 99), but more generally- you can use:

groups = np.unique(set(df.index))

After this, you can iterate over the groups and use fillna() for imputation: %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value'])) Output: 1 loop, best of 3: 231 ms per loop

Note: set_index, sort_index and np.unique operations are a one time cost. To be fair to everyone, the total time (including these operations) was 2.26s on my machine, but the imputation piece took only 231 ms.

like image 3
Shivam Gaur Avatar answered Oct 18 '22 19:10

Shivam Gaur