Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize a non-normal distribution?

n a

I have the above distribution with a mean of -0.02, standard deviation of 0.09 and with a sample size of 13905.

I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.

I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.

like image 631
Chipmunkafy Avatar asked Dec 05 '18 03:12

Chipmunkafy


People also ask

How do you fix non normally distributed data?

Dealing with Non Normal Distributions You can also choose to transform the data with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you may want to run a non parametric test.

Can you standardize a non normal distribution?

The short answer: yes, you do need to worry about your data's distribution not being normal, because standardization does not transform the underlying distribution structure of the data. If X∼N(μ,σ2) then you can transform this to a standard normal by standardizing: Y:=(X−μ)/σ∼N(0,1).

Can I use Z score for non normal distribution?

It is worth noting that the z-score can be used for non-normal distribution using Chebyshev's inequality theorem. This theorem states that for many distributions (including non-normal ones), 75% of its value would be located within two standard deviations and 88.9% would be located within three standard deviations.

Can you use standard deviation for a non normal distribution?

Normal distribution, or not. Specifically it is the square root of the mean squared deviance from the mean. So the standard deviation tells you how spread out the data are from the mean, regardless of distribution.


1 Answers

You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats

You can avoid those steps and simply use Yeo-Johnson transform. sklearn provides an API for that

from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer

data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])

pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)

We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.

k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)

The test returns two values k2 and p. The value of p is of our interest here. if p is greater than some threshold (ex 0.001 or so), we can say reject the hypothesis that data comes from a normal distribution.

In the example above, you'll see that p is greater than 0.001 while transformed_p is less than this threshold indicating that we are moving in the right direction.

like image 159
Clock Slave Avatar answered Sep 23 '22 17:09

Clock Slave