Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I create a histogram from aggregated data in R?

Tags:

r

histogram

I have a data frame that has a format like the following:

Month       Frequency
2007-08     2
2010-11     5
2011-01     43
2011-02     52
2011-03     31
2011-04     64
2011-05     73

I would like to create a histogram from this data, using X bins (X will probably be around 15, but the actual data has over 200 months), and using the data from the frequency column as the frequency for each bin of the histogram. How can I accomplish this?

I've tried two approaches so far, with the hist() and barplot() commands. The problem with hist() is that it does not seem to give me any way to specify that I want to use the frequency column in the frequency calculations for the histogram. The problem with barplot() is that I don't have any flexibility in choosing X bins, and if there are omitted months, then resulting graph is not actually a true histogram because the x-axis isn't continuous.

The only idea I have right now is to go with the barplot() approach, fill in the missing months with a value of 0 for Frequency, and use space=0 to remove the spacing between the bars. The problem with that is that it's not particularly easy to choose an arbitrary number of bins.

like image 754
Stephen Booher Avatar asked Feb 03 '12 17:02

Stephen Booher


People also ask

How do you make a histogram from a dataset in R?

Histogram can be created using the hist() function in R programming language. This function takes in a vector of values for which the histogram is plotted.

Which method is used to create a histogram in R?

We can create histogram in R Programming Language using hist() function.


3 Answers

To get this kind of flexibility, you may have to replicate your data. Here is one way of doing it with rep:

n <- 10
dat <- data.frame(
    x = sort(sample(1:50, n)),
    f = sample(1:100, n))
dat

expdat <- dat[rep(1:n, times=dat$f), "x", drop=FALSE]

Now you have your data replicated in the data.frame expdat, allowing you to call hist with different numbers of bins:

par(mfcol=c(1, 2))
hist(expdat$x, breaks=50, col="blue", main="50 bins")
hist(expdat$x, breaks=5, col="blue", main="5 bins")
par(mfcol=c(1, 1))

enter image description here

like image 167
Andrie Avatar answered Oct 23 '22 15:10

Andrie


take a gander at ggplot2.

if you data is in a data.frame called df:

ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')

or if you want continuous time:

df$Month<-as.POSIXct(paste(df$Month, '01', sep='-'),format='%Y-%m-%d')
ggplot(df,aes(x=Month,y=Frequency))+geom_bar(stat='identity')
like image 27
Justin Avatar answered Oct 23 '22 14:10

Justin


Yea, rep solutions will waste too much memory in most interesting/large cases. The HistogramTools CRAN package includes an efficient PreBinnedHistogram function which creates a base R histogram object directly from a list of bins and breaks as the original question provided.

like image 41
MurrayStokely Avatar answered Oct 23 '22 15:10

MurrayStokely