Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Probability Density of Data

I need to analyze some data about internet sessions for a DSL Line. I wanted to have a look at how the session durations are distributed. I figured a simple way to do this would be to begin by making a probability density plot of the duration of all the sessions.

I have loaded the data in R and used the density() function. So, it was something like this

plot(density(data$duration), type = "l", col = "blue", main = "Density Plot of Duration",
     xlab = "duration(h)", ylab = "probability density")

I am new to R and this kind of analysis. This was what I found from going through google. I got a plot but I was left with some questions. Is this the right function to do what I am trying to do or is there something else?

In the plot I found that the Y-axis scale was from 0...1.5. I don't get how it can be 1.5, shouldn't it be from 0...1?

Also, I would like to get a smoother curve. Since, the data set is really large the lines are really jagged. It would be nicer to have them smoothed out when I am presenting this. How would I go about doing that?

like image 388
sfactor Avatar asked Nov 18 '10 12:11

sfactor


People also ask

How do you find the probability density of data?

To get the probability from a probability density function, we need to integrate the area under the curve for a certain interval. The probability= Area under the curve = density X interval length. In our example, the interval length = 131-41 = 90 so the area under the curve = 0.011 X 90 = 0.99 or ~1.

How do you find probability with probability density?

Therefore, probability is simply the multiplication between probability density values (Y-axis) and tips amount (X-axis). The multiplication is done on each evaluation point and these multiplied values will then be summed up to calculate the final probability.

How do you find the probability density function from a graph?

To find the probability of a variable falling between points a and b, you need to find the area of the curve between a and b. As the probability cannot be more than P(b) and less than P(a), you can represent it as: P(a) <= X <= P(b). Consider the graph below, which shows the rainfall distribution in a year in a city.


1 Answers

As nico said, you should check out hist, but you can also combine the two of them. Then you could call the density with lines instead. Example:

duration <- rpois(500, 10) # For duration data I assume Poisson distributed
hist(duration,
   probability = TRUE, # In stead of frequency
   breaks = "FD",      # For more breaks than the default
   col = "darkslategray4", border = "seashell3")
lines(density(duration - 0.5),   # Add the kernel density estimate (-.5 fix for the bins)
   col = "firebrick2", lwd = 3)

Should give you something like: Histogram of duration

Note that the kernel density estimate assumes a Gaussian kernel as default. But the bandwidth is often the most important factor. If you call density directly it reports the default estimated bandwidth:

> density(duration)

Call:
        density.default(x = duration)

Data: duration (500 obs.);      Bandwidth 'bw' = 0.7752

       x                 y            
 Min.   : 0.6745   Min.   :1.160e-05  
 1st Qu.: 7.0872   1st Qu.:1.038e-03  
 Median :13.5000   Median :1.932e-02  
 Mean   :13.5000   Mean   :3.895e-02  
 3rd Qu.:19.9128   3rd Qu.:7.521e-02  
 Max.   :26.3255   Max.   :1.164e-01  

Here it is 0.7752. Check it for your data and play around with it as nico suggested. You might want to look at ?bw.nrd.

like image 129
eyjo Avatar answered Oct 25 '22 07:10

eyjo