To prepare a little toy example:
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
'age': np.random.randint(0, high, size),
'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
'outcome': np.random.randint(0, high, size)
})
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])
Which will give you something like:
In [2]: df.head(10)
Out[2]:
perception age outlook smokes outcome age_range
0 13 65 positive little 22 60 - 69
1 95 21 neutral lots 95 20 - 29
2 61 53 negative not 4 50 - 59
3 27 98 positive not 42 90 - 99
4 55 99 neutral little 93 90 - 99
5 28 5 negative not 4 0 - 9
6 84 83 positive lots 18 80 - 89
7 66 22 neutral lots 35 20 - 29
8 13 22 negative lots 71 20 - 29
9 58 95 positive not 77 90 - 99
Goal: figure out likelihood of outcome
, given {perception, age, outlook, smokes}
.
Secondary goal: figure out how important each column is in determining outcome
.
Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)
Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.
ANOVA is used when the categorical variable has at least 3 groups (i.e three different unique values). If you want to compare just two groups, use the t-test. I will cover t-test in another article. ANOVA lets you know if your numerical variable changes according to the level of the categorical variable.
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.
The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.
Finding out likelihood of outcome
given columns and Feature importance (1 and 2)
Categorical data
As the dataset contains categorical values, we can use the LabelEncoder()
to convert the categorical data into numeric data.
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])
Result
df.head()
perception age outlook smokes outcome age_range
0 67 43 2 1 78 0 - 9
1 77 66 1 1 13 0 - 9
2 33 10 0 1 1 0 - 9
3 74 46 2 1 22 0 - 9
4 14 26 1 2 16 0 - 9
Without creating any model, we can make use of the chi-squared test
, p-value
and correlation matrix
to determine the relation.
Correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
plt.show()
Chi-squared test and p-value
from sklearn.feature_selection import chi2
res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
'features': df.columns[:4],
'chi2': res[0],
'p-value': res[1]
})
Result
features.head()
features chi2 p-value
0 perception 1436.012987 1.022335e-243
1 age 1416.063117 1.221377e-239
2 outlook 61.139303 9.805304e-01
3 smokes 57.147404 9.929925e-01
Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome
.
Distribution
import scipy as sp
sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()
From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)
Note: As the data is all randomly generated, you results can vary, based on the size of the data set.
References
Hypothesis testing
Feature selection
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With