I have a Dataframe with millions of rows, to create a model, I have taken a random sample from this dataset using dataset.sample(int(len(dataset)/5))
which returns a random sample of items from an axis of the object. Now I want to verify if the sample does not lose statistical significance from the population, that is, ensure the probability distribution of each of the features (columns) of the sample has the same probability distribution for the whole dataset (population). I have numerical as well as categorical features. How can I check that the features have the same probability distribution in Python?
For the continuous variables you can use a Kolmogorov-Smirnov statistic. This tests if two samples are drawn from the same distribution.
Usage in scipy
:
scipy.stats.ks_2samp(data1, data2, alternative='two-sided', mode='auto')
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html
Alternatively if you already know the distribution you can use the KS-test, that tests your data against a given distribution:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
This does not require a test. If you have taken a simple random sample from the entire data frame, the probability distribution of whatever features the data set has is, in fact, the whole data set. That's a property of a simple random sample.
Unfortunately, unless the data set was ALSO sampled properly (something I assume you cannot control at this point) you cannot guarantee that the data set and sample have the same distribution. The probability distribution was determined at the point of sampling the data.
But if you're happy to assume that, then you need no additional checking step to ensure that your random sample does its job - this is provably guaranteed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With