Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting Probability Density / Mass Function of Dataset in R

Tags:

I have a dataset and I want to analyse these data with a probability density function or a probability mass function in R. I used a density function but it didn't gave me the probability.

My data are like this:

"step","Time","energy" 1, 22469 , 392.96E-03 2, 22547 , 394.82E-03 3, 22828,400.72E-03 4, 21765, 383.51E-03 5, 21516, 379.85E-03 6, 21453, 379.89E-03 7, 22156, 387.47E-03 8, 21844, 384.09E-03 9 , 21250, 376.14E-03 10,  21703, 380.83E-03 

I want to the get PDF/PMF for the energy vector ; the data we take into account are discrete in nature so I don't have any special type for the distribution of the data.

like image 562
Alaa Brihi Avatar asked Aug 07 '11 15:08

Alaa Brihi


People also ask

How do you plot probability mass function in R?

To plot the probability mass function for a binomial distribution in R, we can use the following functions: dbinom(x, size, prob) to create the probability mass function. plot(x, y, type = 'h') to plot the probability mass function, specifying the plot to be a histogram (type='h')

How do I plot a CDF in R?

To plot a CDF function in base R, we first calculate the CDF by using the ecdf() function. Then we use the plot() function to plot the CDF plot in the R Language. The plot function takes the result of the ecdf() function as an argument to plot the CDF plot.

What is probability mass function and density function?

Probability mass and density functions are used to describe discrete and continuous probability distributions, respectively. This allows us to determine the probability of an observation being exactly equal to a target value (discrete) or within a set range around our target value (continuous).

What is R in probability density function?

R Functions for Probability Distributions d for "density", the density function (p. f. or p. d. f.) r for "random", a random variable having the specified distribution.


1 Answers

Your data looks far from discrete to me. Expecting a probability when working with continuous data is plain wrong. density() gives you an empirical density function, which approximates the true density function. To prove it is a correct density, we calculate the area under the curve :

energy <- rnorm(100) dens <- density(energy) sum(dens$y)*diff(dens$x[1:2]) [1] 1.000952 

Given some rounding error. the area under the curve sums up to one, and hence the outcome of density() fulfills the requirements of a PDF.

Use the probability=TRUE option of hist or the function density() (or both)

eg :

hist(energy,probability=TRUE) lines(density(energy),col="red") 

gives

enter image description here

If you really need a probability for a discrete variable, you use:

 x <- sample(letters[1:4],1000,replace=TRUE)  prop.table(table(x)) x     a     b     c     d  0.244 0.262 0.275 0.219  

Edit : illustration why the naive count(x)/sum(count(x)) is not a solution. Indeed, it's not because the values of the bins sum to one, that the area under the curve does. For that, you have to multiply with the width of the 'bins'. Take the normal distribution, for which we can calculate the PDF using dnorm(). Following code constructs a normal distribution, calculates the density, and compares with the naive solution :

x <- sort(rnorm(100,0,0.5)) h <- hist(x,plot=FALSE) dens1 <-  h$counts/sum(h$counts) dens2 <- dnorm(x,0,0.5)  hist(x,probability=TRUE,breaks="fd",ylim=c(0,1)) lines(h$mids,dens1,col="red") lines(x,dens2,col="darkgreen") 

Gives :

enter image description here


The cumulative distribution function

In case @Iterator was right, it's rather easy to construct the cumulative distribution function from the density. The CDF is the integral of the PDF. In the case of the discrete values, that simply the sum of the probabilities. For the continuous values, we can use the fact that the intervals for the estimation of the empirical density are equal, and calculate :

cdf <- cumsum(dens$y * diff(dens$x[1:2])) cdf <- cdf / max(cdf) # to correct for the rounding errors plot(dens$x,cdf,type="l") 

Gives :

enter image description here

like image 118
Joris Meys Avatar answered Dec 29 '22 06:12

Joris Meys