Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore a column while building a model with SKLearn

With R, one can ignore a variable (column) while building a model with the following syntax:

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)

It's very handy when your data set contains indexes or ID.

How would you do that with SKlearn in python, assuming your data are Pandas data frames ?

like image 303
Mathieu Avatar asked May 01 '14 10:05

Mathieu


People also ask

What does fit () do in Sklearn?

The 'fit' method trains the algorithm on the training data, after the model is initialized. That's really all it does. So the sklearn fit method uses the training data as an input to train the machine learning model.

How does fit () work in Python?

fit() is implemented by every estimator and it accepts an input for the sample data ( X ) and for supervised models it also accepts an argument for labels (i.e. target data y ). Optionally, it can also accept additional sample properties such as weights etc. fit methods are usually responsible for numerous operations.

When should I use Sklearn ColumnTransformer?

Use the scikit-learn ColumnTransformer function to implement preprocessing functions such as MinMaxScaler and OneHotEncoder to numeric and categorical features simultaneously. Use ColumnTransformer to build all our transformations together into one object and use it with scikit-learn pipelines.

What is ColumnTransformer Sklearn?

Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.


1 Answers

So this is from my own code I used to do some prediction on StackOverflow last year:

from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier

basic_feature_names = [ 'BodyLength'
                      , 'NumTags'
                      , 'OwnerUndeletedAnswerCountAtPostTime'
                      , 'ReputationAtPostCreation'
                      , 'TitleLength'
                      , 'UserAge' ]

fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now 
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])

So if we wanted a subset of the features for classification I could have done this:

# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')

clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)

So the idea here is that a list can be used to select a subset of the columns in the pandas dataframe, as such we can construct a new list or remove a value and use this for selection

I'm not sure how you could do this easily using numpy arrays as indexing is done differently.

like image 56
EdChum Avatar answered Oct 20 '22 11:10

EdChum