Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply Polynomial Transformation to subset of features in scikitlearn

Scikitlearn's PolynomialFeatures facilitates polynomial feature generation.

Here is a simple example:

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# Example data:
X = np.arange(6).reshape(3, 2)


# Works fine
poly = PolynomialFeatures(2)
pd.DataFrame(poly.fit_transform(X))

   0  1  2   3   4   5
0  1  0  1   0   0   1
1  1  2  3   4   6   9
2  1  4  5  16  20  25

Question: Is there any capability to only have the polynomial transformation apply to a specified list of features?

e.g.

# Use previous dataframe
X2 = X.copy()

# Categorical feature will be handled 
# by a one hot encoder in another feature generation step
X2['animal'] = ['dog', 'dog', 'cat']

# Don't try to poly transform the animal column
poly2 = PolynomialFeatures(2, cols=[1,2]) # <-- ("cols" not an actual param)

# desired outcome:
pd.DataFrame(poly2.fit_transform(X))
   0  1  2   3   4   5   'animal'
0  1  0  1   0   0   1   'dog'
1  1  2  3   4   6   9   'dog'
2  1  4  5  16  20  25   'cat'

This would be particularly useful when using the Pipeline feature to combine a long series of feature generation and model training code.

One option would be to roll-your-own transformer (great example by Michelle Fullwood), but I figured someone else would have stumbled across this use case before.

like image 981
rmstmppr Avatar asked Dec 05 '17 22:12

rmstmppr


2 Answers

Yes there is, check out sklearn-pandas

This should work (there should be a more elegant solution, but can't test it now):

from sklearn.preprocessing import PolynomialFeatures
from sklearn_pandas import DataFrameMapper

X2.columns = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'animal']

mapper = DataFrameMapper([
('col0', PolynomialFeatures(2)),
('col1', PolynomialFeatures(2)),
('col2', PolynomialFeatures(2)),
('col3', PolynomialFeatures(2)),
('col4', PolynomialFeatures(2)),
('col5', PolynomialFeatures(2)),
('Animal', None)])

X3 = mapper.fit_transform(X2)
like image 72
plumbus_bouquet Avatar answered Oct 04 '22 16:10

plumbus_bouquet


In response to the answer from Peng Jun Huang - the approach is terrific but implementation has issues. (This should be a comment but it's a bit long for that. Also, don't have enough cookies for that.)

I tried to use the code and had some problems. After fooling around a bit, I found the following answer to the original question. The main issue is that the ColumnExtractor needs to inherit from BaseEstimator and TransformerMixin to turn it into an estimator that can be used with other sklearn tools.

My example data shows two numerical variables and one categorical variable. I used pd.get_dummies to do the one-hot encoding to keep the pipeline a bit simpler. Also, I left out the last stage of the pipeline (the estimator) because we have no y data to fit; the main point is to show select, process separately and join.

Enjoy.

M.

import pandas as pd
import numpy as np
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

X = pd.DataFrame({'cat': ['a', 'b', 'c'], 'n1': [1, 2, 3], 'n2':[5, 7, 9] })

   cat  n1  n2
0   a   1   5
1   b   2   7
2   c   3   9

# original version had class ColumnExtractor(object)
# estimators need to inherit from these classes to play nicely with others
class ColumnExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_cols = X[self.columns]
        return X_cols

# Using pandas get dummies to make pipeline a bit simpler by
# avoiding one-hot and label encoder.     
# Build the pipeline from a FeatureUnion that processes 
# numerical and one-hot encoded separately.
# FeatureUnion puts them back together when it's done.
pipe2nvars = Pipeline([
    ('features', FeatureUnion([('num', 
                                Pipeline([('extract', 
                                           ColumnExtractor(columns=['n1', 'n2'])),
                                          ('poly', 
                                           PolynomialFeatures())  ])),
                               ('cat_var', 
                                ColumnExtractor(columns=['cat_b','cat_c']))])
    )])    

# now show it working...
for p in range(1, 4):
    pipe2nvars.set_params(features__num__poly__degree=p)
    res = pipe2nvars.fit_transform(pd.get_dummies(X, drop_first=True))
    print('polynomial degree: {}; shape: {}'.format(p, res.shape))
    print(res)

polynomial degree: 1; shape: (3, 5)
[[1. 1. 5. 0. 0.]
 [1. 2. 7. 1. 0.]
 [1. 3. 9. 0. 1.]]
polynomial degree: 2; shape: (3, 8)
[[ 1.  1.  5.  1.  5. 25.  0.  0.]
 [ 1.  2.  7.  4. 14. 49.  1.  0.]
 [ 1.  3.  9.  9. 27. 81.  0.  1.]]
polynomial degree: 3; shape: (3, 12)
[[  1.   1.   5.   1.   5.  25.   1.   5.  25. 125.   0.   0.]
 [  1.   2.   7.   4.  14.  49.   8.  28.  98. 343.   1.   0.]
 [  1.   3.   9.   9.  27.  81.  27.  81. 243. 729.   0.   1.]]
like image 30
leonkato Avatar answered Oct 04 '22 15:10

leonkato