I am trying to replicate in R a time-series scatterplot I have created in Stata on a subset of data. The scatterplot has the time variable 'date' on the x-axis (mm/dd/yyyy) and the integer variable 'cost' on the y-axis (monetary amount, in USD). The marker labels are of a categorical variable, 'company name'.
The actual dataset is very large, but a sample would look like the following (see R code below), with observations (i.e. rows) indicating transactions (column 1), followed by variables indicating the date of the transaction (column 2), the cost of the transaction (column 3), and the name of the company that initiated the transaction (column 4).
#Sample Data Frame (R Code)
transactionID <- c(1, 2, 3, 4)
date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05"))
cost <- as.integer(c(1208, 23820, 402, 89943))
company <- c("ACo", "BInc", "CInd", "DOp")
thedata <- data.frame(transactionID, date, cost, company)
The scatterplot I want will have 'date' on the x-axis and 'cost' on the y-axis, 'company' listed as the marker labels, and will also have 3 vertical lines of various formatting to signify important events. The steps to producing this in Stata are
display mdy(9,10,2007)
display mdy(1, 28, 2008)
display mdy(2, 5, 2013)
The three display commands above return the values 17419, 17559, 19394, which are how Stata reads those days internally, and which are embedded in the code below for graphing the scatterplot.
graph twoway scatter cost date if cost <= 3000 , mlabel(company) xline(17419, lpatt(dot) lwidth(thick) lcol(red)) xline(17559, lpatt(dash) lwidth(medthick) lcol(blue)) xline(19394, lpatt(solid) lwidth(thin) lcol(green))
When I've tried to replicate it in R I have encountered the following problems
So far I have pieced together the following code. I originally tried to do it with the base R installation commands plot() and text(), but it seems like it cannot be done in base R. So then I tried using the ggplot2 package but still can't quite figure it out like I could in Stata:
library(ggplot2)
ggplot(thedata, aes(date, cost)) +
geom_text( label = thedata$company, color="blue", vjust = 0) +
geom_vline( xintercept = as.numeric( thedata$date[
c(I don't know what goes here, or here)]),
linetype="dotted", color="red")
As you can see, I cannot figure out how the coordinates for the xintercept of the geom_vline command work (and can't find it in the official help file), specifically when I want them to be dates (particularly dates that may or not be in the data frame), nor can I figure out how to change the thickness of the line.
very nicely done question. If you are still interested in a base solution:
transactionID <- c(1, 2, 3, 4)
date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05"))
cost <- as.integer(c(1208, 23820, 402, 89943))
company <- c("ACo", "BInc", "CInd", "DOp")
thedata <- data.frame(transactionID, date, cost, company)
par(mar = c(5,7,3,2), tcl = .2, las = 1)
with(thedata,
plot(date, cost, xlab = 'Date', ylab = '', axes = FALSE, main = 'a plot'))
dseq <- seq.Date(as.Date('2006-01-01'), as.Date('2013-01-01'), by = 'year')
axis.Date(1, at = dseq, labels = format(dseq, format = '%Y'))
# axis.Date(1, at = seq.Date(min(date), max(date), by = 'year'))
axis(2, at = pretty(cost),
labels = format(pretty(cost), scientific = FALSE, big.mark = ','))
## add lines at specified dates
abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3,
lty = c('dotted','dashed','solid'), col = c('red','blue','green'))
## add company labels
text(x = date, y = cost, pos = 3, xpd = NA,
labels = ifelse(cost <= 3000, company, ''))
title(ylab = 'Cost', line = 5)
box('plot', bty = 'l')
To address some specific questions:
I use as.Date
. R stores dates similarly to stata
abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3,
lty = c('dotted','dashed','solid'), col = c('red','blue','green'))
use formatting
format(pretty(cost), scientific = FALSE, big.mark = ',')
# [1] " 0" " 20,000" " 40,000" " 60,000" " 80,000" "100,000"
you can of course create some subsets if you are more comfortable with that, but there is usually a way to do a one-liner in r
ifelse(cost <= 3000, company, '')
# [1] "ACo" "" "CInd" ""
Most of the base plot functions are vectorized which is why this is so easy. And I am not a ggplot wizard, and it usually results in a headache for me when I try to do very specifically formatted plots like these. Generally, ggplot is good for nice, quick, dirty graphs. If you want something very specific or to publish, base r graphics is the way to go.
So here is a ggplot
method which, I think, produces what you are asking.
library(ggplot2)
key.events <- data.frame(date=as.Date(c("2007-09-10","2008-01-28","2012-01-18")))
ggplot(thedata[thedata$cost>3000,],aes(x=date,y=cost))+
geom_point(shape=1,size=3)+
geom_text(aes(label=company),vjust=-1)+
scale_y_continuous(expand=c(0.2,0.2))+
geom_vline(data=key.events, size=1,
aes(xintercept=as.integer(date),color=factor(date),linetype=factor(date)))+
scale_color_manual(values=c("red","blue","green"))+
scale_linetype_manual(values=c("dotted","dashed","solid"))+
theme_bw()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With