Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force R to plot histogram as probability (relative frequency)

Tags:

r

histogram

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

This is in the R help, but I don't know how to override it:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

Thanks

edit: details

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?

like image 601
SwimBikeRun Avatar asked Jul 02 '13 02:07

SwimBikeRun


People also ask

How do you make a relative frequency histogram in R?

The relative frequency histogram can be created for the column of an R data frame or a vector that contains discrete data. For this purpose, we can use PlotRelativeFrequency function of HistogramTools package along with hist function to generate histogram.

How do you plot probability distribution in R?

To plot the probability mass function for a binomial distribution in R, we can use the following functions: dbinom(x, size, prob) to create the probability mass function. plot(x, y, type = 'h') to plot the probability mass function, specifying the plot to be a histogram (type='h')


Video Answer


4 Answers

To answer the request to plot probabilities rather than densities:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
like image 118
IRTFM Avatar answered Oct 23 '22 20:10

IRTFM


The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?

like image 40
Hong Ooi Avatar answered Oct 23 '22 19:10

Hong Ooi


Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1
like image 22
Gavin Simpson Avatar answered Oct 23 '22 20:10

Gavin Simpson


set.seed(0)

# Define a fair coin:
coin = c(1,0)

# We tossed the coin 10 times and counted the number of heads. Repeat the experiment 20000 times.
n = 20000   # Number of experiments
flips = 10  # Number of coin flips in each experiment.

heads = colSums(replicate(n, sample(coin, flips, replace = T))) # Counts of heads in each experiment.

# The breaks are the number of possible outcomes: flips + 1

h = hist(heads, breaks = sort(unique(heads)), freq=F, 
          border=F, main = 'Histogram counts of heads',
          col=rgb(0.3,0.8,0.8,0.6), ylab='Probability', 
          xlab =  'No. of heads in 10 flips fair coin')

enter image description here


If it helps anyone landing here, check this solution:

set.seed(0)

d = rnorm(1000)
n = 1000
d = rnorm(n)

histogram = hist(d, breaks=10, prob=T, border=F)
unique(diff(histogram$breaks)) # Because the size of the base of the rectangles is 0.5, the height will be double the tru relative freq.

# The fix. Notice that I redefine the histogram simply to show how simple the call is with with this fix.
h = hist(d, plot=F)
bp = barplot(h$counts/sum(h$counts), border=F)
axis(1, at=c(bp), labels=h$mids)
title(ylab="Relative Frequency")

Thanks to this answer.

like image 43
Antoni Parellada Avatar answered Oct 23 '22 21:10

Antoni Parellada