How to convert data into normal distribution

Question

I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:

Number of unique OS used
Number of unique Browsers user
Number of unique cookies used All these numbers are taken over a period of six months.

Now I did try to do a normal test using:

from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)

Which returns 0.0 meaning the data is not following normal distribution.

Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.

I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?

Sean O'Callaghan · Accepted Answer

In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.

I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.

Something like

df_visits = df_visits.apply(lambda x: np.log(x))

of course you will also need to get rid of any zeros before you can log transform.

Image showing pre Vs post log transform

How to convert data into normal distribution

Tags:

python

logic

normalization

data-science

transformation

Kshitij Yadav

1 Answers

Sean O'Callaghan

Recent Activity

Donate For Us

How to convert data into normal distribution

Tags:

python

logic

normalization

data-science

transformation

Kshitij Yadav

1 Answers

Sean O'Callaghan

Related questions

Recent Activity

Donate For Us