How to check if sample has same probability distribution as population in Python?

Question

I have a Dataframe with millions of rows, to create a model, I have taken a random sample from this dataset using dataset.sample(int(len(dataset)/5)) which returns a random sample of items from an axis of the object. Now I want to verify if the sample does not lose statistical significance from the population, that is, ensure the probability distribution of each of the features (columns) of the sample has the same probability distribution for the whole dataset (population). I have numerical as well as categorical features. How can I check that the features have the same probability distribution in Python?

Bobby Klann · Accepted Answer

For the continuous variables you can use a Kolmogorov-Smirnov statistic. This tests if two samples are drawn from the same distribution.

Usage in scipy:

scipy.stats.ks_2samp(data1, data2, alternative='two-sided', mode='auto')

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

Alternatively if you already know the distribution you can use the KS-test, that tests your data against a given distribution:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

richarddmorey · Answer

This does not require a test. If you have taken a simple random sample from the entire data frame, the probability distribution of whatever features the data set has is, in fact, the whole data set. That's a property of a simple random sample.

Unfortunately, unless the data set was ALSO sampled properly (something I assume you cannot control at this point) you cannot guarantee that the data set and sample have the same distribution. The probability distribution was determined at the point of sampling the data.

But if you're happy to assume that, then you need no additional checking step to ensure that your random sample does its job - this is provably guaranteed.

How to check if sample has same probability distribution as population in Python?

Tags:

python

machine-learning

probability

Anirban Saha

2 Answers

Bobby Klann

richarddmorey

Recent Activity

Donate For Us

How to check if sample has same probability distribution as population in Python?

Tags:

python

machine-learning

probability

Anirban Saha

2 Answers

Bobby Klann

richarddmorey

Related questions

Recent Activity

Donate For Us