I am trying to run a usual linear regression in Python using sk-learn, but I have some categorical data that I don't know exactly how to handle, especially because I imported the data using pandas read.csv()
and I have learned from previous experiences and reading that Pandas and sk-learn don't get along quite well (yet).
My data looks like this:
Salary AtBat Hits League EastDivision
475 315 81 1 0
480 479 130 0 0
500 496 141 1 1
I wanna predict Salary using AtBat, Hits, League and EastDivision, where League and EastDivision are categorical.
If I import the data via numpy's loadtext()
I get a numpy array which in theory I could use with sklearn, but when I use DictVectorizer I get an error. My code is:
import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV
nphitters=np.loadtxt('Hitters.csv',delimiter=',', skiprows=1)
vec = DV( sparse = False )
catL=vec.fit_transform(nphitters[:,3:4])
And I get the error when I run the last line catL=vec.fit_transform(nphitters[:,3:4])
, the error is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 142, in fit_transform
self.fit(X)
File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 107, in fit
for f, v in six.iteritems(x):
File "/usr/lib/python2.7/dist-packages/sklearn/externals/six.py", line 268, in iteritems
return iter(getattr(d, _iteritems)())
AttributeError: 'numpy.ndarray' object has no attribute 'iteritems'
I don't know how to fix it, and another thing is, once I get the categorical data working, how do I run the regression? Just as if the categorical variable were another numeric variable?
I have found several questions similar to mine, but none of them have really worked for me.
Categorical variables can absolutely used in a linear regression model.
Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the pandas.
You can directly feed categorical variables to random forest using below approach: Firstly convert categories of feature to numbers using sklearn label encoder. Secondly convert label encoded feature type to string(object)
In this chapter we described how categorical variables are included in linear regression model. As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables.
Basically what happens is that you are passing a vector of 1 and 0 to a function that will take keys and values (like a dictionary) and create a table for you
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
will become
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
or
|bar|baz|foo |<br>
|---|---|-----|<br>
| 2 | 0 | 1 |<br>
| 0 | 0 | 3 |<br>
read: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html
in your case, the data is ready for a linear regression as the features league and east division are dummies already.
It looks like .fit_transform()
expects a dict
but .loadtxt()
create a numpy array.
You can use .to_dict()
after reading your data with pandas
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With