I have the above distribution with a mean of -0.02
, standard deviation of 0.09
and with a sample size of 13905
.
I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.
I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.
Dealing with Non Normal Distributions You can also choose to transform the data with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you may want to run a non parametric test.
The short answer: yes, you do need to worry about your data's distribution not being normal, because standardization does not transform the underlying distribution structure of the data. If X∼N(μ,σ2) then you can transform this to a standard normal by standardizing: Y:=(X−μ)/σ∼N(0,1).
It is worth noting that the z-score can be used for non-normal distribution using Chebyshev's inequality theorem. This theorem states that for many distributions (including non-normal ones), 75% of its value would be located within two standard deviations and 88.9% would be located within three standard deviations.
Normal distribution, or not. Specifically it is the square root of the mean squared deviance from the mean. So the standard deviation tells you how spread out the data are from the mean, regardless of distribution.
You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats
You can avoid those steps and simply use Yeo-Johnson transform. sklearn
provides an API for that
from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer
data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])
pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)
We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.
k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)
The test returns two values k2
and p
. The value of p
is of our interest here.
if p
is greater than some threshold (ex 0.001
or so), we can say reject the hypothesis that data
comes from a normal distribution.
In the example above, you'll see that p
is greater than 0.001
while transformed_p
is less than this threshold indicating that we are moving in the right direction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With