Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert data into normal distribution

I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:

  • Number of unique OS used
  • Number of unique Browsers user
  • Number of unique cookies used All these numbers are taken over a period of six months.

Now I did try to do a normal test using:

from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)

Which returns 0.0 meaning the data is not following normal distribution.

Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.

I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?

like image 820
Kshitij Yadav Avatar asked Mar 05 '26 22:03

Kshitij Yadav


1 Answers

In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.

I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.

Something like

df_visits = df_visits.apply(lambda x: np.log(x))

of course you will also need to get rid of any zeros before you can log transform.

Image showing pre Vs post log transform

like image 187
Sean O'Callaghan Avatar answered Mar 07 '26 12:03

Sean O'Callaghan