I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
I'm looking for:
%Y-b
formatI've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b
formatting, though, so I hunted around and tried the following, based on this SO:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
I worked through the example in the ggplot2 documentation at the scale_x_date
section and geom_line()
appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
Assume for both:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on @edgester's answer below, I was able to do the following:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
Plot based on gauden's approach:
Note the following:
table(dates$Date)
reveals that there are 19 instances of 2009-12-01
and 26 instances of 2010-03-01
in the dataAny thoughts on the differences here? edgester's method of creating a separate count
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
format=
option did not work for me.Date
vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).You can also make histograms by using ggplot2 , “a plotting system for R, based on the grammar of graphics” that was created by Hadley Wickham. This post will focus on making a Histogram With ggplot2.
Basic histogram with geom_histogram It is relatively straightforward to build a histogram with ggplot2 thanks to the geom_histogram() function. Only one numeric variable is needed in the input.
To create a Relative Frequency Histogram in the R Language, we use the histogram() function of the lattice package library. The histogram() function takes the data vector as an argument and returns a relative frequency histogram.
UPDATE
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date
does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
And here is (somewhat excessively) commented code:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
I try a solution that does everything in ggplot2
, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks=
argument for geom_histogram()
and creating a little shortcut function to make the required sequence.
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With