Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear regression analysis with string/categorical features (variables)?

Regression algorithms seem to be working on features represented as numbers. For example:

simple data without categorical features

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.


But now I want to do a regression analysis on data that contain categorical features:

data-set with categorical features

There are 5 features: District, Condition, Material, Security, Type


How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

like image 829
Erba Aitbayev Avatar asked Nov 30 '15 20:11

Erba Aitbayev


People also ask

Can you use linear regression with categorical variables?

All Answers (16) Categorical variables can absolutely used in a linear regression model.

How regression is performed with categorical variables?

Categorical regression is also known by the acronym CATREG, for categorical regression. Standard linear regression analysis involves minimizing the sum of squared differences between a response (dependent) variable and a weighted combination of predictor (independent) variables.

Can you use categorical variables in linear regression Python?

Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the pandas.

Which regression is best for categorical data?

LOGISTIC REGRESSION MODEL This model is the most popular for binary dependent variables. It is highly recommended to start from this model setting before more sophisticated categorical modeling is carried out. Dependent variable yi can only take two possible outcomes.


1 Answers

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

Usually there are three possibilities:

  1. One-Hot encoding for categorical data
  2. Arbitrary numbers for ordinal data
  3. Use something like group means for categorical data (e. g. mean prices for city districts).

You have to be carefull to not infuse information you do not have in the application case.

One hot encoding

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

idx color 0   blue 1   green 2   green 3   red 

to

idx blue green red 0   1    0     0 1   0    1     0 2   0    1     0 3   0    0     1 

This can easily be done with pandas:

import pandas as pd  data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']}) print(pd.get_dummies(data)) 

will result in:

   color_blue  color_green  color_red 0           1            0          0 1           0            1          0 2           0            1          0 3           0            0          1 

Numbers for ordinal data

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

This is also possible with pandas:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']}) data['q'] = data['q'].astype('category') data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True) data['q'] = data['q'].cat.codes print(data['q']) 

Result:

0    0 1    2 2    2 3    1 Name: q, dtype: int8 

Using categorical data for groupby operations

You could use the mean for each category over past (known events).

Say you have a DataFrame with the last known mean prices for cities:

prices = pd.DataFrame({     'city': ['A', 'A', 'A', 'B', 'B', 'C'],     'price': [1, 1, 1, 2, 2, 3], }) mean_price = prices.groupby('city').mean() data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})  print(data.merge(mean_price, on='city', how='left')) 

Result:

  city  price 0    A      1 1    B      2 2    C      3 3    A      1 4    B      2 5    A      1 
like image 109
MaxNoe Avatar answered Sep 21 '22 10:09

MaxNoe