Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to transform some columns only with SimpleImputer or equivalent

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.

I have read carefully the documentation but I still cannot figure out how to achieve this.

To make this more specific, let's say I have:

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?

An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?

like image 568
quiet-ranger Avatar asked Aug 13 '19 10:08

quiet-ranger


People also ask

How do I use SimpleImputer?

To use SimpleImputer, first import the class, and then instantiate the class with a string argument passed to the strategy parameter. For clarity, I have included 'mean' here, which is the default and therefore not necessary to explicitly include.

What is the role of SimpleImputer function defined in Scikitlearn library?

SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder.

What does imputer fit do?

You use an Imputer to handle missing data in your dataset. Imputer gives you easy methods to replace NaNs and blanks with something like the mean of the column or even median. But before it can replace these values, it has to calculate the value that will be used to replace blanks.


1 Answers

I am assuming you have your data as a pandas dataframe.

In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.

An example of this is,

## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))

Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

like image 54
Amar Avatar answered Nov 02 '22 23:11

Amar