I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform
can achieve this.
However, transform
is too slow for my purposes.
For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN
values:
import pandas as pd
import numpy as np
size = 10000000 # DataFrame length
ngroups = 100 # Number of Groups
randgroups = np.random.randint(ngroups, size=size) # Creation of groups
randvals = np.random.rand(size) * randgroups * 2 # Random values with mean like group number
nan_indices = np.random.permutation(range(size)) # NaN indices
nanfrac = 0.7 # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)] # Take fraction of NaN indices
randvals[nan_indices] = np.NaN # Set NaN values
df = pd.DataFrame({'value': randvals, 'group': randgroups}) # Create data frame
Using transform
via
df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long
takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).
So how can I fill the missing values any faster?
fillna() method is used to fill NaN/NA values on a specified column or on an entire DataaFrame with any given value. You can specify modify using inplace, or limit how many filling to perform or choose an axis whether to fill on rows/column etc. The Below example fills all NaN values with None value.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe. ffill() function is used forward fill the missing value in the dataframe. So this recipe is a short example on What is ffill and bfill in pandas.
Here's a NumPy approach using np.bincount
that's pretty efficient for such bin-based summing/averaging operations -
ids = df.group.values # Extract 2 columns as two arrays
vals = df.value.values
m = np.isnan(vals) # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m)) # Group averages
vals[m] = avg_vals[ids[m]] # Set avg values into NaN positions
Note that this would update the value
column.
Runtime test
Datasizes :
size = 1000000 # DataFrame length
ngroups = 10 # Number of Groups
Timings :
In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop
In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop
In [19]: 276.0/13.6 # Speedup
Out[19]: 20.294117647058822
20x+
speedup there!
you're doing it wrong. it's slow because you're using a lambda
df[['value']].fillna(df.groupby('group').transform('mean'))
fillna()
You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78s to run.
Example Code:
%%timeit
df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean()))
Output:
1 loop, best of 3: 3.18 s per loop`
piRSquared's improvement:
%%timeit
df[['value']].fillna(df.groupby('group').transform('mean'))
Output:
1 loop, best of 3: 2.78 s per loop
Slightly more efficient way (using a sorted index and fillna
):
You can set the group
column as the index of the dataframe, and sort it.
df = df.set_index('group').sort_index()
Now that you have a sorted index, the it's super cheap to access a subset of the dataframe by the group number, by using df.loc[x,:]
Since you need to impute by the mean for every group, you need all the unique group id's. For this example, you could use range
(since the groups are from 0 to 99), but more generally- you can use:
groups = np.unique(set(df.index))
After this, you can iterate over the groups and use fillna()
for imputation:
%%timeit
for x in groups:
df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value']))
Output:
1 loop, best of 3: 231 ms per loop
Note: set_index
, sort_index
and np.unique
operations are a one time cost. To be fair to everyone, the total time (including these operations) was 2.26s on my machine, but the imputation piece took only 231 ms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With