I am working on bigmart dataset and I would like to substitute missing values of a column based on the values of another column, practically:
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 NaN 0-1000
4 High 0-1000
... ... ...
8518 High 2000-3000
8519 NaN 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium
But I really don’t know how to write it and all the information I found seems confusing to me
Is it possible to do it? How?
Many thanks
You can use pandas.Series.map instead of numpy.where.
pandas.Series.map seems to be handier for these simple cases, which makes multiple imputations easier and explicit with dictionaries (say {'0-1000': 'Small', '2000-3000': 'High'}).
numpy.where is designed to handle more logic (ex: if a < 5 then a^2) which is not very useful in the OP use case, but comes at some cost, like making multiple imputations tricky to handle (nested if-else).
Steps :
Example :
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 Medium 2000-3000
11 Medium 1000-2000
Example with multiple imputations :
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 High 2000-3000
11 Medium 1000-2000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With