Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize only certain columns in scikit-learn?

I have data similar to the following:

[
   [0, 4, 15]
   [0, 3, 7]
   [1, 5, 9]
   [2, 4, 15]
]

I used oneHotEncoder http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit_transform to preprocess this data so it is suitable for linear regression to give me this:

[
   [1, 0, 0, 4, 15]
   [1, 0, 0, 3, 7]
   [0, 1, 0, 5, 9]
   [0, 0, 1, 4, 15]
]

However, I then wish to normalise this data.

So far I am just normalising the data like so:

preprocessing.normalize(data)

However, this normalises all the columns including category ones.

My questions are the following:

  • How do I normalise only certain columns?
  • Is it desirable to normalise category data, or should I avoid it?

Thank you!

like image 802
Yahya Uddin Avatar asked Jan 31 '16 00:01

Yahya Uddin


2 Answers

Use numpy to pass a slice of your data to normalize. As for your question about normalizing category data, you will probably get a better answer to that question on CrossValidated.

Example for your first question:

In [1]: import numpy as np
        from sklearn.preprocessing import normalize

        # Values as floats or normalize raises a type error
        X1 = np.array([
                      [1., 0., 0., 4., 15.],
                      [1., 0., 0., 3., 7.],
                      [0., 1., 0., 5., 9.],
                      [0., 0., 1., 4., 15.],
                      ])

In [2]: X1[:, [3,4]] # last two columns
Out[2]: array([[  4.,  15.],
               [  3.,   7.],
               [  5.,   9.],
               [  4.,  15.]])

Normalize the last two columns and assign to a new numpy array, X2.

In [3]: X2 = normalize(X1[:, [3,4]], axis=0) #axis=0 for column-wise
        X2
Out[3]: array([[ 0.49236596,  0.6228411 ],
               [ 0.36927447,  0.29065918],
               [ 0.61545745,  0.37370466],
               [ 0.49236596,  0.6228411 ]])

Now concatenate X1 and X2 for your desired output.

In [4]: np.concatenate(( X1[:,[0,1,2]], X2), axis=1)
Out[4]: array([[ 1.        ,  0.        ,  0.        ,  0.49236596,  0.6228411 ],
               [ 1.        ,  0.        ,  0.        ,  0.36927447,  0.29065918],
               [ 0.        ,  1.        ,  0.        ,  0.61545745,  0.37370466],
               [ 0.        ,  0.        ,  1.        ,  0.49236596,  0.6228411 ]])
like image 173
Kevin Avatar answered Oct 31 '22 10:10

Kevin


If you're using pandas.DataFrame, you might want to check sklearn-pandas.

like image 43
Dror Avatar answered Oct 31 '22 11:10

Dror