Regression algorithms seem to be working on features represented as numbers. For example: <img src="https://i.stack.imgur.com/deFE1.jpg" alt="simple data without categorical features"> This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price. <hr> But now I want to do a regression analysis on data that contain categorical features: <img src="https://i.stack.imgur.com/gLRrh.jpg" alt="data-set with categorical features"> There are 5 features: <code>District</code>, <code>Condition</code>, <code>Material</code>, <code>Security</code>, <code>Type</code> <hr> How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values. Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent. Usually there are three possibilities: <ol> <li>One-Hot encoding for categorical data</li> <li>Arbitrary numbers for ordinal data</li> <li>Use something like group means for categorical data (e. g. mean prices for city districts).</li> </ol> You have to be carefull to not infuse information you do not have in the application case. <h3>One hot encoding</h3> If you have categorical data, you can create dummy variables with 0/1 values for each possible value. E. g. <pre class="prettyprint"><code>idx color 0 blue 1 green 2 green 3 red </code></pre> to <pre class="prettyprint"><code>idx blue green red 0 1 0 0 1 0 1 0 2 0 1 0 3 0 0 1 </code></pre> This can easily be done with pandas: <pre class="prettyprint"><code>import pandas as pd data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']}) print(pd.get_dummies(data)) </code></pre> will result in: <pre class="prettyprint"><code> color_blue color_green color_red 0 1 0 0 1 0 1 0 2 0 1 0 3 0 0 1 </code></pre> <h3>Numbers for ordinal data</h3> Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2 This is also possible with pandas: <pre class="prettyprint"><code>data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']}) data['q'] = data['q'].astype('category') data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True) data['q'] = data['q'].cat.codes print(data['q']) </code></pre> Result: <pre class="prettyprint"><code>0 0 1 2 2 2 3 1 Name: q, dtype: int8 </code></pre> <h3>Using categorical data for groupby operations</h3> You could use the mean for each category over past (known events). Say you have a DataFrame with the last known mean prices for cities: <pre class="prettyprint"><code>prices = pd.DataFrame({ 'city': ['A', 'A', 'A', 'B', 'B', 'C'], 'price': [1, 1, 1, 2, 2, 3], }) mean_price = prices.groupby('city').mean() data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']}) print(data.merge(mean_price, on='city', how='left')) </code></pre> Result: <pre class="prettyprint"><code> city price 0 A 1 1 B 2 2 C 3 3 A 1 4 B 2 5 A 1 </code></pre>

Linear regression analysis with string/categorical features (variables)?

Tags:

python

machine-learning

linear-regression

regression

feature-selection

Regression algorithms seem to be working on features represented as numbers. For example:

simple data without categorical features

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.

But now I want to do a regression analysis on data that contain categorical features:

data-set with categorical features

There are 5 features: District, Condition, Material, Security, Type

How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

829

asked Nov 30 '15 20:11

Erba Aitbayev

1 Answers

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

Usually there are three possibilities:

One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Use something like group means for categorical data (e. g. mean prices for city districts).

You have to be carefull to not infuse information you do not have in the application case.

One hot encoding

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

idx color 0   blue 1   green 2   green 3   red

idx blue green red 0   1    0     0 1   0    1     0 2   0    1     0 3   0    0     1

This can easily be done with pandas:

import pandas as pd  data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']}) print(pd.get_dummies(data))

will result in:

   color_blue  color_green  color_red 0           1            0          0 1           0            1          0 2           0            1          0 3           0            0          1

Numbers for ordinal data

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

This is also possible with pandas:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']}) data['q'] = data['q'].astype('category') data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True) data['q'] = data['q'].cat.codes print(data['q'])

Result:

0    0 1    2 2    2 3    1 Name: q, dtype: int8

Using categorical data for groupby operations

You could use the mean for each category over past (known events).

Say you have a DataFrame with the last known mean prices for cities:

prices = pd.DataFrame({     'city': ['A', 'A', 'A', 'B', 'B', 'C'],     'price': [1, 1, 1, 2, 2, 3], }) mean_price = prices.groupby('city').mean() data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})  print(data.merge(mean_price, on='city', how='left'))

Result:

  city  price 0    A      1 1    B      2 2    C      3 3    A      1 4    B      2 5    A      1

109

answered Sep 21 '22 10:09

MaxNoe

Related questions
                            
                                Removing duplicate columns after a DF join in Spark
                            
                                Resolving a relative url path to its absolute path
                            
                                Python - Convert string representation of date to ISO 8601
                            
                                Storing and Accessing node attributes python networkx
                            
                                How to install a package inside virtualenv?
                            
                                'pytest' exits with no error, but with "collected 0 items"
                            
                                How can I force Python's file.write() to use the same newline format in Windows as in Linux ("\r\n" vs. "\n")?
                            
                                Flask throwing 'working outside of request context' when starting sub thread
                            
                                Pandas 'describe' is not returning summary of all columns
                            
                                Why am I getting "IndentationError: expected an indented block"? [duplicate]
                            
                                Tkinter: AttributeError: NoneType object has no attribute <attribute name>
                            
                                Decode Hex String in Python 3
                            
                                Mid-line comment in Python?
                            
                                How to do CamelCase split in python
                            
                                Why is a list comprehension so much faster than appending to a list?
                            
                                Django: How to create a model dynamically just for testing
                            
                                Numpy slice of arbitrary dimensions
                            
                                Import error, No module named xxxx [duplicate]
                            
                                What is key=lambda
                            
                                No usable temporary directory found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With