Categorical variables usage in pandas for ANOVA and regression?

Tags:

To prepare a little toy example:

import pandas as pd
import numpy as np

high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
                   'age': np.random.randint(0, high, size),
                   'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
                   'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
                   'outcome': np.random.randint(0, high, size)
                  })
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
                             labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])

Which will give you something like:

In [2]: df.head(10)
Out[2]:
   perception  age   outlook  smokes  outcome age_range
0          13   65  positive  little       22   60 - 69
1          95   21   neutral    lots       95   20 - 29
2          61   53  negative     not        4   50 - 59
3          27   98  positive     not       42   90 - 99
4          55   99   neutral  little       93   90 - 99
5          28    5  negative     not        4     0 - 9
6          84   83  positive    lots       18   80 - 89
7          66   22   neutral    lots       35   20 - 29
8          13   22  negative    lots       71   20 - 29
9          58   95  positive     not       77   90 - 99

Goal: figure out likelihood of outcome, given {perception, age, outlook, smokes}.

Secondary goal: figure out how important each column is in determining outcome.

Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)

Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?

489

asked May 23 '19 01:05

A T

1 Answers

Finding out likelihood of outcome given columns and Feature importance (1 and 2)

Categorical data

As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])

Result

df.head()

   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9

Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.

Correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Correlation matrix

Chi-squared test and p-value

from sklearn.feature_selection import chi2

res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})

Result

features.head()

     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01

Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.

Distribution

import scipy as sp

sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

Distribution

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)

Note: As the data is all randomly generated, you results can vary, based on the size of the data set.

References

Hypothesis testing
Feature selection

answered Nov 09 '22 14:11

skillsmuggler

Related questions
                            
                                how to add columns label on a Pandas DataFrame
                            
                                masking a series with a boolean array
                            
                                Is it possible to construct a Pandas Series which auto-interpolates?
                            
                                Pandas/Excel: Any way to encode the ALT-ENTER / CHAR(10) line break into data when calling DataFrame.to_excel()?
                            
                                Sort bins from pandas cut
                            
                                str.replace function creating NaN data
                            
                                How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?
                            
                                levels option in pandas concat
                            
                                Python error cannot do a non empty take from an empty axes
                            
                                What exactly does the Pandas random_state do?
                            
                                perform operation opposite to pandas ffill
                            
                                Python/Pandas Binning Data Timedelta
                            
                                Center datetimes of resampled time series
                            
                                How to populate Pandas dataframe as function of index and columns
                            
                                Pandas weird behavior using .replace() to swap values
                            
                                pyspark equivalence of `df.loc`?
                            
                                SVR Model -->Feature Scaling - Expected 2D array, got 1D array instead
                            
                                TypeError: ("unsupported operand type(s) for -: 'decimal.Decimal' and 'float'", 'occurred at index growth(%)')
                            
                                Is there a way to convert data frame styler object into dataframe in python
                            
                                Change the cell width of a data frame in Jupyter Notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Categorical variables usage in pandas for ANOVA and regression?

Tags:

pandas

numpy

scipy

hypothesis-test

anova

A T

People also ask

1 Answers

skillsmuggler

Recent Activity

Donate For Us