I need to analyze some data about internet sessions for a DSL Line. I wanted to have a look at how the session durations are distributed. I figured a simple way to do this would be to begin by making a probability density plot of the duration of all the sessions. I have loaded the data in R and used the <code>density()</code> function. So, it was something like this <pre class="prettyprint"><code>plot(density(data$duration), type = "l", col = "blue", main = "Density Plot of Duration", xlab = "duration(h)", ylab = "probability density") </code></pre> I am new to R and this kind of analysis. This was what I found from going through google. I got a plot but I was left with some questions. Is this the right function to do what I am trying to do or is there something else? In the plot I found that the Y-axis scale was from 0...1.5. I don't get how it can be 1.5, shouldn't it be from 0...1? Also, I would like to get a smoother curve. Since, the data set is really large the lines are really jagged. It would be nicer to have them smoothed out when I am presenting this. How would I go about doing that?

As nico said, you should check out <code>hist</code>, but you can also combine the two of them. Then you could call the density with <code>lines</code> instead. Example: <pre class="prettyprint"><code>duration <- rpois(500, 10) # For duration data I assume Poisson distributed hist(duration, probability = TRUE, # In stead of frequency breaks = "FD", # For more breaks than the default col = "darkslategray4", border = "seashell3") lines(density(duration - 0.5), # Add the kernel density estimate (-.5 fix for the bins) col = "firebrick2", lwd = 3) </code></pre> Should give you something like: <img src="https://i.stack.imgur.com/qwtpF.png" alt="Histogram of duration"> Note that the kernel density estimate assumes a Gaussian kernel as default. But the bandwidth is often the most important factor. If you call <code>density</code> directly it reports the default estimated bandwidth: <pre class="prettyprint"><code>> density(duration) Call: density.default(x = duration) Data: duration (500 obs.); Bandwidth 'bw' = 0.7752 x y Min. : 0.6745 Min. :1.160e-05 1st Qu.: 7.0872 1st Qu.:1.038e-03 Median :13.5000 Median :1.932e-02 Mean :13.5000 Mean :3.895e-02 3rd Qu.:19.9128 3rd Qu.:7.521e-02 Max. :26.3255 Max. :1.164e-01 </code></pre> Here it is 0.7752. Check it for your data and play around with it as nico suggested. You might want to look at <code>?bw.nrd</code>.

Getting Probability Density of Data

Tags:

plot

r

probability

data-analysis

distribution

I need to analyze some data about internet sessions for a DSL Line. I wanted to have a look at how the session durations are distributed. I figured a simple way to do this would be to begin by making a probability density plot of the duration of all the sessions.

I have loaded the data in R and used the density() function. So, it was something like this

plot(density(data$duration), type = "l", col = "blue", main = "Density Plot of Duration",
     xlab = "duration(h)", ylab = "probability density")

I am new to R and this kind of analysis. This was what I found from going through google. I got a plot but I was left with some questions. Is this the right function to do what I am trying to do or is there something else?

In the plot I found that the Y-axis scale was from 0...1.5. I don't get how it can be 1.5, shouldn't it be from 0...1?

Also, I would like to get a smoother curve. Since, the data set is really large the lines are really jagged. It would be nicer to have them smoothed out when I am presenting this. How would I go about doing that?

388

asked Nov 18 '10 12:11

sfactor

1 Answers

As nico said, you should check out hist, but you can also combine the two of them. Then you could call the density with lines instead. Example:

duration <- rpois(500, 10) # For duration data I assume Poisson distributed
hist(duration,
   probability = TRUE, # In stead of frequency
   breaks = "FD",      # For more breaks than the default
   col = "darkslategray4", border = "seashell3")
lines(density(duration - 0.5),   # Add the kernel density estimate (-.5 fix for the bins)
   col = "firebrick2", lwd = 3)

Should give you something like: Histogram of duration

Note that the kernel density estimate assumes a Gaussian kernel as default. But the bandwidth is often the most important factor. If you call density directly it reports the default estimated bandwidth:

> density(duration)

Call:
        density.default(x = duration)

Data: duration (500 obs.);      Bandwidth 'bw' = 0.7752

       x                 y            
 Min.   : 0.6745   Min.   :1.160e-05  
 1st Qu.: 7.0872   1st Qu.:1.038e-03  
 Median :13.5000   Median :1.932e-02  
 Mean   :13.5000   Mean   :3.895e-02  
 3rd Qu.:19.9128   3rd Qu.:7.521e-02  
 Max.   :26.3255   Max.   :1.164e-01

Here it is 0.7752. Check it for your data and play around with it as nico suggested. You might want to look at ?bw.nrd.

129

answered Oct 25 '22 07:10

eyjo

Related questions
                            
                                Select columns based on multiple attribute conditions
                            
                                Conditional filtering using tidyverse
                            
                                Error with select function from dplyr
                            
                                Image classification (raster stack) with random forest (package ranger)
                            
                                Include pattern in list.dirs
                            
                                How to use LaTeX Code in R Chunk in R-Markdown?
                            
                                Are Pandas' dataframes (Python) closer to R's dataframes or datatables? [closed]
                            
                                Joining two data frames with intervals misbehaves?
                            
                                R - ggplot line color (using geom_line) doesn't change
                            
                                What is Julia's equivalent of R's which?
                            
                                Replace loop with one of the functions of the "apply" family
                            
                                Shiny - Draw Right Border Of Column
                            
                                Suppress messages from underlying C-function in R
                            
                                Using pipes within map() in R
                            
                                Sort a data.table programmatically using character vector of multiple column names
                            
                                What is the difference between the color and fill argument in ggplot2?
                            
                                Change geom default aesthetics as part of theme component only
                            
                                Count Trailing and Leading NA for each vector
                            
                                Best way to store variable-length data in an R data.frame?
                            
                                R + ggplot: plotting irregular time series

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With