How to transform some columns only with SimpleImputer or equivalent

Tags:

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.

I have read carefully the documentation but I still cannot figure out how to achieve this.

To make this more specific, let's say I have:

A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]

And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?

An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?

568

asked Aug 13 '19 10:08

quiet-ranger

1 Answers

I am assuming you have your data as a pandas dataframe.

In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.

An example of this is,

## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))

Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

answered Nov 02 '22 23:11

Amar

Related questions
                            
                                Convert pandas dataframe to numpy array - which approach to prefer? [duplicate]
                            
                                How to find the cube root in Python?
                            
                                Plotting Histogram for all columns in a Data Frame
                            
                                Fbprophet installation error - failed building wheel for fbprophet
                            
                                How to check for nan and empty string
                            
                                Explaining get() method with **kwargs?
                            
                                Subtract values from maximum value within groups
                            
                                Alternate different models in Pipeline for GridSearchCV
                            
                                Get subnet from IP address
                            
                                pandas pivot and join in two dataframes
                            
                                Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd [duplicate]
                            
                                How to weight classes using fit_generator() in Keras?
                            
                                When I use HttpResponseRedirect I get TypeError: quote_from_bytes() expected bytes in Django
                            
                                Google Colab-ValueError: Mountpoint must be in a directory that exists
                            
                                Scikit-learn - Cannot load MNIST Original dataset using fetch_openml in Python
                            
                                understanding object_pairs_hook in json.loads()
                            
                                asyncio gather scheduling order guarantee
                            
                                How to fix ImportError: cannot import name 'Event' in Dash from plotly (python)?
                            
                                An error occurred (ThrottlingException) when calling the GetDeployment operation (reached max retries: 4): Rate exceeded
                            
                                How can I make a map using GeoJSON data in Altair?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to transform some columns only with SimpleImputer or equivalent

Tags:

python

pandas

imputation

scikit-learn

data-science

quiet-ranger

People also ask

1 Answers

Amar

Recent Activity

Donate For Us