I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?
To use SimpleImputer, first import the class, and then instantiate the class with a string argument passed to the strategy parameter. For clarity, I have included 'mean' here, which is the default and therefore not necessary to explicitly include.
SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.
You use an Imputer to handle missing data in your dataset. Imputer gives you easy methods to replace NaNs and blanks with something like the mean of the column or even median. But before it can replace these values, it has to calculate the value that will be used to replace blanks.
I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer
from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With