Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear Regression with sklearn using categorical variables

I am trying to run a usual linear regression in Python using sk-learn, but I have some categorical data that I don't know exactly how to handle, especially because I imported the data using pandas read.csv() and I have learned from previous experiences and reading that Pandas and sk-learn don't get along quite well (yet).

My data looks like this:

Salary  AtBat   Hits    League  EastDivision
475     315     81      1       0
480     479     130     0       0
500     496     141     1       1

I wanna predict Salary using AtBat, Hits, League and EastDivision, where League and EastDivision are categorical.

If I import the data via numpy's loadtext() I get a numpy array which in theory I could use with sklearn, but when I use DictVectorizer I get an error. My code is:

import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV

nphitters=np.loadtxt('Hitters.csv',delimiter=',', skiprows=1)
vec = DV( sparse = False )
catL=vec.fit_transform(nphitters[:,3:4])

And I get the error when I run the last line catL=vec.fit_transform(nphitters[:,3:4]), the error is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 142, in fit_transform
    self.fit(X)
  File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 107, in fit
    for f, v in six.iteritems(x):
  File "/usr/lib/python2.7/dist-packages/sklearn/externals/six.py", line 268, in iteritems
    return iter(getattr(d, _iteritems)())
AttributeError: 'numpy.ndarray' object has no attribute 'iteritems'

I don't know how to fix it, and another thing is, once I get the categorical data working, how do I run the regression? Just as if the categorical variable were another numeric variable?

I have found several questions similar to mine, but none of them have really worked for me.

like image 315
Mario Becerra Avatar asked Oct 05 '14 02:10

Mario Becerra


People also ask

Can you use linear regression with categorical variables?

Categorical variables can absolutely used in a linear regression model.

Can you use categorical variables in linear regression Python?

Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the pandas.

Can sklearn handle categorical variables?

You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)

Can linear regression have categorical predictors?

In this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables.


2 Answers

Basically what happens is that you are passing a vector of 1 and 0 to a function that will take keys and values (like a dictionary) and create a table for you

D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]

will become

array([[ 2.,  0.,  1.],
       [ 0.,  1.,  3.]])

or

|bar|baz|foo  |<br>
|---|---|-----|<br>
| 2 | 0 | 1   |<br>
| 0 | 0 | 3   |<br>

read: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

in your case, the data is ready for a linear regression as the features league and east division are dummies already.

like image 129
Adriano Almeida Avatar answered Oct 12 '22 08:10

Adriano Almeida


It looks like .fit_transform() expects a dict but .loadtxt() create a numpy array.

You can use .to_dict() after reading your data with pandas.

like image 44
polku Avatar answered Oct 12 '22 09:10

polku