Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

trying to compare two distributions

I found this code on internet that compares a normal distribution to different student distributions:

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",
  ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
  lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:

library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)

The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !

like image 923
jeremy.staub Avatar asked Aug 05 '12 22:08

jeremy.staub


People also ask

How you would compare 2 distributions?

The simplest way to compare two distributions is via the Z-test. The error in the mean is calculated by dividing the dispersion by the square root of the number of data points. In the above diagram, there is some population mean that is the true intrinsic mean value for that population.

What is the best way to visually compare two distributions?

The usual way to compare data distributions is to use histograms. One technique is to display a panel of histograms, which are known as comparative histograms.

How do you find the difference between two probability distributions?

One way to measure the dissimilarity of two probability distributions, p and q, is known as the Kullback-Leibler divergence (KL divergence) or relative entropy.


2 Answers

If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.

 library(tseries)
 jarque.bera.test(ss)

    Jarque Bera Test

data:  ss 
X-squared = 4100.781, df = 2, p-value < 2.2e-16

Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.

To see why your data is not normaly distributed you can take a look at the descriptive statistics:

 library(fBasics)
 basicStats(ss)
                     ss
nobs        3776.000000
NAs            0.000000
Minimum       -0.105195
Maximum        0.187713
1. Quartile   -0.009417
3. Quartile    0.010220
Mean           0.000462
Median         0.001224
Sum            1.745798
SE Mean        0.000336
LCL Mean      -0.000197
UCL Mean       0.001122
Variance       0.000427
Stdev          0.020671
Skewness       0.322820
Kurtosis       5.060026

From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.

But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:

 plot(density(ss, kernel='epanechnikov'))
 set.seed(125)
 lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)

enter image description here

In this fashion you can generate other curve from another probability distribution.

The tests suggested by @Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.

like image 80
Jilber Urbina Avatar answered Sep 22 '22 20:09

Jilber Urbina


Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.

like image 40
Alex Reynolds Avatar answered Sep 25 '22 20:09

Alex Reynolds