Linear Regression with sklearn using categorical variables

Tags:

I am trying to run a usual linear regression in Python using sk-learn, but I have some categorical data that I don't know exactly how to handle, especially because I imported the data using pandas read.csv() and I have learned from previous experiences and reading that Pandas and sk-learn don't get along quite well (yet).

My data looks like this:

Salary  AtBat   Hits    League  EastDivision
475     315     81      1       0
480     479     130     0       0
500     496     141     1       1

I wanna predict Salary using AtBat, Hits, League and EastDivision, where League and EastDivision are categorical.

If I import the data via numpy's loadtext() I get a numpy array which in theory I could use with sklearn, but when I use DictVectorizer I get an error. My code is:

import numpy as np
from sklearn.feature_extraction import DictVectorizer as DV

nphitters=np.loadtxt('Hitters.csv',delimiter=',', skiprows=1)
vec = DV( sparse = False )
catL=vec.fit_transform(nphitters[:,3:4])

And I get the error when I run the last line catL=vec.fit_transform(nphitters[:,3:4]), the error is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 142, in fit_transform
    self.fit(X)
  File "/usr/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 107, in fit
    for f, v in six.iteritems(x):
  File "/usr/lib/python2.7/dist-packages/sklearn/externals/six.py", line 268, in iteritems
    return iter(getattr(d, _iteritems)())
AttributeError: 'numpy.ndarray' object has no attribute 'iteritems'

I don't know how to fix it, and another thing is, once I get the categorical data working, how do I run the regression? Just as if the categorical variable were another numeric variable?

I have found several questions similar to mine, but none of them have really worked for me.

315

asked Oct 05 '14 02:10

Mario Becerra

2 Answers

Basically what happens is that you are passing a vector of 1 and 0 to a function that will take keys and values (like a dictionary) and create a table for you

D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]

will become

array([[ 2.,  0.,  1.],
       [ 0.,  1.,  3.]])

|bar|baz|foo  |<br>
|---|---|-----|<br>
| 2 | 0 | 1   |<br>
| 0 | 0 | 3   |<br>

read: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html

in your case, the data is ready for a linear regression as the features league and east division are dummies already.

129

answered Oct 12 '22 08:10

Adriano Almeida

It looks like .fit_transform() expects a dict but .loadtxt() create a numpy array.

You can use .to_dict() after reading your data with pandas.

answered Oct 12 '22 09:10

polku

Related questions
                            
                                Format the output of elasticsearch-py
                            
                                Completely refresh SQLAlchemy with dynamic table generation
                            
                                How to wrap text in Django admin(set column width)
                            
                                PyParsing : how to use SkipTo and OR(^) operator
                            
                                argparse subcommand error message
                            
                                Acquiring first available lock/semaphore on python asyncio
                            
                                numpy.array to PNG file and back
                            
                                Release Python Flask port when script is terminated
                            
                                Adding user to group on creation in Django
                            
                                How to generate links for all languages on top in Pelican site for current page (article)
                            
                                How to override default create method in django-rest-framework
                            
                                Relation between sigma and bandwidth in gaussian_filter and gaussian_kde
                            
                                Replacing nginx with uwsgi
                            
                                Get all points of a straight line in python
                            
                                How to create initial revision for test objects when using django-reversion in test case
                            
                                How to run script in Pyspark and drop into IPython shell when done?
                            
                                User info using OAuth with Google App Engine
                            
                                Signal Handling in Windows
                            
                                tkinter option menu - update options on fly
                            
                                Reraise exception from subprocess

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Linear Regression with sklearn using categorical variables

Tags:

python

scikit-learn

linear-regression

categorical-data

Mario Becerra

People also ask

2 Answers

Adriano Almeida

polku

Recent Activity

Donate For Us