I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:
Now I did try to do a normal test using:
from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)
Which returns 0.0 meaning the data is not following normal distribution.
Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.
I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?
In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.
I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.
Something like
df_visits = df_visits.apply(lambda x: np.log(x))
of course you will also need to get rid of any zeros before you can log transform.
Image showing pre Vs post log transform
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With