i have 2 columns with similar data. I plot them to compare their distributions and i want to quantify their difference.
df = pd.DataFrame({'a':['cat','dog','bird','cat','dog','dog','dog'],
'b':['cat','cat','cat','bird','dog','dog','dog']})
I then plot the 2 columns of my data frame to compare their distributions:
ax = df['a'].value_counts().plot(kind='bar', color='blue', width=.75, legend=True, alpha=0.8)
df['b'].value_counts().plot(kind='bar', color='maroon', width=.5, alpha=1, legend=True)

How can i quantify the difference in the distributions statistically to say how similar they are?
would it be a simple t-test or something else?
It is very common to use the two-sided Kolmogorov-Smirnov test for this.
In Python, you can do so with scipy.stats.ks_2samp:
from scipy import stats
merged = pd.merge(
df.a.value_counts().to_frame(),
df.b.value_counts().to_frame(),
left_index=True,
right_index=True)
stats.ks_2samp(merged.a, merged.b)
Broadly speaking, if the second value of the returned tuple is small (say less than 0.05), you should reject the hypothesis that the distributions are the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With