I'm trying to understand how to use categorical data as features in <code>sklearn.linear_model</code>'s <code>LogisticRegression</code>. I understand of course I need to encode it. <ol> <li>What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.</li> <li>(Less important) Can somebody explain the difference between using <code>preprocessing.LabelEncoder()</code>, <code>DictVectorizer.vocabulary</code> or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.</li> </ol> Especially with the first one!

Suppose the type of each categorical variable is "object". Firstly, you can create an <code>panda.index</code> of categorical column names: <pre class="prettyprint"><code>import pandas as pd catColumns = df.select_dtypes(['object']).columns </code></pre> Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the <code>LabelEncoder()</code> to convert it to <code>0</code> and <code>1</code>. For categorical variables with more than two categories, use <code>pd.getDummies()</code> to obtain the indicator variables and then drop one category (to avoid multicollinearity issue). <pre class="prettyprint"><code>from sklearn import preprocessing le = preprocessing.LabelEncoder() for col in catColumns: n = len(df[col].unique()) if (n > 2): X = pd.get_dummies(df[col]) X = X.drop(X.columns[0], axis=1) df[X.columns] = X df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional) else: le.fit(df[col]) df[col] = le.transform(df[col]) </code></pre>

Using categorical data as features in sklean LogisticRegression

1 Answers

Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:

import pandas as pd    
catColumns = df.select_dtypes(['object']).columns

Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
       X = pd.get_dummies(df[col])
       X = X.drop(X.columns[0], axis=1)
       df[X.columns] = X
       df.drop(col, axis=1, inplace=True)  # drop the original categorical variable (optional)
    else:
       le.fit(df[col])
       df[col] = le.transform(df[col])

168

answered Nov 08 '22 08:11

Yongkai

Related questions
                            
                                Running Multiple Scrapy Spiders (the easy way) Python
                            
                                Is there a Python reusable component that is like the Blender node editor? [closed]
                            
                                Django admin - Mixing multiple model inlines in single admin interface
                            
                                How to modify matplotlib legend after it has been created?
                            
                                Python's implementation of Mutual Information
                            
                                How do I download a file from S3 using boto only if the remote file is newer than a local copy?
                            
                                How to embed Bokeh server in Django application
                            
                                Reading streaming http response with Python "requests" library
                            
                                CGI script downloads instead of running
                            
                                Python HTTP server send JSON response
                            
                                what's the use of transformer_weights in scikit-learn pipeline?
                            
                                python pexpect clearing or flushing the line
                            
                                Run specific Django tests (with django-nose?)
                            
                                Python multi dimensional sparse array
                            
                                Using virtualenv with Sublime Text 3 and SublimeREPL
                            
                                Argmax of each row or column in scipy sparse matrix
                            
                                Cython: working with C++ streams
                            
                                Python logging module: duplicated console output [IPython Notebook/Qtconsole]
                            
                                Random numbers generation in PySpark
                            
                                Intersection of Two LineStrings Geopandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using categorical data as features in sklean LogisticRegression

Tags:

python

data-modeling

scikit-learn

logistic-regression

regression

Optimesh

People also ask

1 Answers

Yongkai

Recent Activity

Donate For Us