I am working on bigmart dataset and I would like to substitute missing values of a column based on the values of another column, practically:
     Outlet_Size  sales_bin
0         Medium  3000-4000
1         Medium     0-1000
2         Medium  2000-3000
3            NaN     0-1000
4           High     0-1000
...          ...        ...
8518        High  2000-3000
8519         NaN     0-1000
8520       Small  1000-2000
8521      Medium  1000-2000
8522       Small     0-1000
So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is  “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium
But I really don’t know how to write it and all the information I found seems confusing to me
Is it possible to do it? How?
Many thanks
You can use pandas.Series.map instead of numpy.where.
pandas.Series.map seems to be handier for these simple cases, which makes multiple imputations easier and explicit with dictionaries (say {'0-1000': 'Small', '2000-3000': 'High'}).
numpy.where is designed to handle more logic (ex: if a < 5 then a^2) which is not very useful in the OP use case, but comes at some cost, like making multiple imputations tricky to handle (nested if-else).
Steps :
Example :
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
    'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
    'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
   Outlet_Size  sales_bin
0       Medium  3000-4000
1       Medium     0-1000
2       Medium  2000-3000
3        Small     0-1000
4         High     0-1000
5         High  2000-3000
6        Small     0-1000
7        Small  1000-2000
8       Medium  1000-2000
9        Small     0-1000
10      Medium  2000-3000
11      Medium  1000-2000
Example with multiple imputations :
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
    'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
    'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
    Outlet_Size sales_bin
0   Medium  3000-4000
1   Medium  0-1000
2   Medium  2000-3000
3   Small   0-1000
4   High    0-1000
5   High    2000-3000
6   Small   0-1000
7   Small   1000-2000
8   Medium  1000-2000
9   Small   0-1000
10  High    2000-3000
11  Medium  1000-2000
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With