To prepare a little toy example:
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
                   'age': np.random.randint(0, high, size),
                   'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
                   'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
                   'outcome': np.random.randint(0, high, size)
                  })
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
                             labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])
Which will give you something like:
In [2]: df.head(10)
Out[2]:
   perception  age   outlook  smokes  outcome age_range
0          13   65  positive  little       22   60 - 69
1          95   21   neutral    lots       95   20 - 29
2          61   53  negative     not        4   50 - 59
3          27   98  positive     not       42   90 - 99
4          55   99   neutral  little       93   90 - 99
5          28    5  negative     not        4     0 - 9
6          84   83  positive    lots       18   80 - 89
7          66   22   neutral    lots       35   20 - 29
8          13   22  negative    lots       71   20 - 29
9          58   95  positive     not       77   90 - 99
Goal: figure out likelihood of outcome, given {perception, age, outlook, smokes}.
Secondary goal: figure out how important each column is in determining outcome.
Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)
Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.
ANOVA is used when the categorical variable has at least 3 groups (i.e three different unique values). If you want to compare just two groups, use the t-test. I will cover t-test in another article. ANOVA lets you know if your numerical variable changes according to the level of the categorical variable.
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.
The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.
Finding out likelihood of outcome given columns and Feature importance (1 and 2)
Categorical data
As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])
Result
df.head()
   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9
Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.
Correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Chi-squared test and p-value
from sklearn.feature_selection import chi2
res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})
Result
features.head()
     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01
Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.
Distribution
import scipy as sp
sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)
Note: As the data is all randomly generated, you results can vary, based on the size of the data set.
References
Hypothesis testing
Feature selection
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With