I am having trouble plotting a histogram as a pdf (probability)
I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.
hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)
What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)
This is in the R help, but I don't know how to override it:
freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).
Thanks
edit: details
hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0
edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?
I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?
The relative frequency histogram can be created for the column of an R data frame or a vector that contains discrete data. For this purpose, we can use PlotRelativeFrequency function of HistogramTools package along with hist function to generate histogram.
To plot the probability mass function for a binomial distribution in R, we can use the following functions: dbinom(x, size, prob) to create the probability mass function. plot(x, y, type = 'h') to plot the probability mass function, specifying the plot to be a histogram (type='h')
To answer the request to plot probabilities rather than densities:
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
The default number of breaks is around log2(N)
where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim
in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
you should get a result of 1.
The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.
Does the value 1.0 have some kind of significance?
Are you sure? This is working for me:
> vec <- rnorm(6000000)
>
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01
Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.
This is with
> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
set.seed(0)
# Define a fair coin:
coin = c(1,0)
# We tossed the coin 10 times and counted the number of heads. Repeat the experiment 20000 times.
n = 20000 # Number of experiments
flips = 10 # Number of coin flips in each experiment.
heads = colSums(replicate(n, sample(coin, flips, replace = T))) # Counts of heads in each experiment.
# The breaks are the number of possible outcomes: flips + 1
h = hist(heads, breaks = sort(unique(heads)), freq=F,
border=F, main = 'Histogram counts of heads',
col=rgb(0.3,0.8,0.8,0.6), ylab='Probability',
xlab = 'No. of heads in 10 flips fair coin')
If it helps anyone landing here, check this solution:
set.seed(0)
d = rnorm(1000)
n = 1000
d = rnorm(n)
histogram = hist(d, breaks=10, prob=T, border=F)
unique(diff(histogram$breaks)) # Because the size of the base of the rectangles is 0.5, the height will be double the tru relative freq.
# The fix. Notice that I redefine the histogram simply to show how simple the call is with with this fix.
h = hist(d, plot=F)
bp = barplot(h$counts/sum(h$counts), border=F)
axis(1, at=c(bp), labels=h$mids)
title(ylab="Relative Frequency")
Thanks to this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With