Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Categorical variables usage in pandas for ANOVA and regression?

To prepare a little toy example:

import pandas as pd
import numpy as np

high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
                   'age': np.random.randint(0, high, size),
                   'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
                   'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
                   'outcome': np.random.randint(0, high, size)
                  })
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
                             labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])

Which will give you something like:

In [2]: df.head(10)
Out[2]:
   perception  age   outlook  smokes  outcome age_range
0          13   65  positive  little       22   60 - 69
1          95   21   neutral    lots       95   20 - 29
2          61   53  negative     not        4   50 - 59
3          27   98  positive     not       42   90 - 99
4          55   99   neutral  little       93   90 - 99
5          28    5  negative     not        4     0 - 9
6          84   83  positive    lots       18   80 - 89
7          66   22   neutral    lots       35   20 - 29
8          13   22  negative    lots       71   20 - 29
9          58   95  positive     not       77   90 - 99

Goal: figure out likelihood of outcome, given {perception, age, outlook, smokes}.

Secondary goal: figure out how important each column is in determining outcome.

Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)


Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?

like image 489
A T Avatar asked May 23 '19 01:05

A T


People also ask

Can you use categorical variables in ANOVA?

A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.

Can ANOVA be used for categorical response?

ANOVA is used when the categorical variable has at least 3 groups (i.e three different unique values). If you want to compare just two groups, use the t-test. I will cover t-test in another article. ANOVA lets you know if your numerical variable changes according to the level of the categorical variable.

How do you treat categorical variables in regression?

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.

How do pandas deal with categorical variables?

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.


1 Answers

Finding out likelihood of outcome given columns and Feature importance (1 and 2)

Categorical data

As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])

Result

df.head()

   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9

Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.

Correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Correlation matrix

Chi-squared test and p-value

from sklearn.feature_selection import chi2

res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})

Result

features.head()

     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01

Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.

Distribution

import scipy as sp

sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

Distribution

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)

Note: As the data is all randomly generated, you results can vary, based on the size of the data set.

References

  • Hypothesis testing

  • Feature selection

like image 73
skillsmuggler Avatar answered Nov 09 '22 14:11

skillsmuggler