I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three 1 1 10 1 1 nan 1 1 nan 1 2 nan 1 2 20 1 2 nan 1 3 nan 1 3 nan
I wanted to using the keys of column one
and two
which is similar and if column three
is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three 1 1 10 1 1 10 1 1 10 1 2 20 1 2 20 1 2 20 1 3 nan 1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby
+fillna()
:
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
Definition and UsageThe fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.
If only one non NaN value per group use ffill
(forward filling) and bfill
(backward filling) per group, so need apply
with lambda
:
df['three'] = df.groupby(['one','two'], sort=False)['three'] .apply(lambda x: x.ffill().bfill()) print (df) one two three 0 1 1 10.0 1 1 1 10.0 2 1 1 10.0 3 1 2 20.0 4 1 2 20.0 5 1 2 20.0 6 1 3 NaN 7 1 3 NaN
But if multiple value per group and need replace NaN
by some constant - e.g. mean
by group:
print (df) one two three 0 1 1 10.0 1 1 1 40.0 2 1 1 NaN 3 1 2 NaN 4 1 2 20.0 5 1 2 NaN 6 1 3 NaN 7 1 3 NaN df['three'] = df.groupby(['one','two'], sort=False)['three'] .apply(lambda x: x.fillna(x.mean())) print (df) one two three 0 1 1 10.0 1 1 1 40.0 2 1 1 25.0 3 1 2 20.0 4 1 2 20.0 5 1 2 20.0 6 1 3 NaN 7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True) df['three'] = df.groupby(['one','two'])['three'].ffill()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With