Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From Stata to R: creating a scatterplot with vertical date lines on a subset

Introduction

I am trying to replicate in R a time-series scatterplot I have created in Stata on a subset of data. The scatterplot has the time variable 'date' on the x-axis (mm/dd/yyyy) and the integer variable 'cost' on the y-axis (monetary amount, in USD). The marker labels are of a categorical variable, 'company name'.

The actual dataset is very large, but a sample would look like the following (see R code below), with observations (i.e. rows) indicating transactions (column 1), followed by variables indicating the date of the transaction (column 2), the cost of the transaction (column 3), and the name of the company that initiated the transaction (column 4).

#Sample Data Frame (R Code)

transactionID <- c(1, 2, 3, 4) 
date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05"))
cost <- as.integer(c(1208, 23820, 402, 89943))
company <- c("ACo", "BInc", "CInd", "DOp")
thedata <- data.frame(transactionID, date, cost, company)

Doing it in Stata

The scatterplot I want will have 'date' on the x-axis and 'cost' on the y-axis, 'company' listed as the marker labels, and will also have 3 vertical lines of various formatting to signify important events. The steps to producing this in Stata are

  1. Identify x-axis points for vertical lines at the dates September 10 2007, January 28 2008, January 18 2012, and February 5 2013.

display mdy(9,10,2007)

display mdy(1, 28, 2008)

display mdy(2, 5, 2013)

The three display commands above return the values 17419, 17559, 19394, which are how Stata reads those days internally, and which are embedded in the code below for graphing the scatterplot.

  1. Create the scatterplot, adding in the three vertical lines from Step 1, formatting them as as dotted, dashed, and solid lines of red, blue, and green colors and of different thicknesses, with 'cost' on the y-axis, 'date' on the x-axis, and 'company' name as the marker labels, for only those transactions that were less than or equal to $3,000:

graph twoway scatter cost date if cost <= 3000 , mlabel(company) xline(17419, lpatt(dot) lwidth(thick) lcol(red)) xline(17559, lpatt(dash) lwidth(medthick) lcol(blue)) xline(19394, lpatt(solid) lwidth(thin) lcol(green))

Problems Doing it in R

When I've tried to replicate it in R I have encountered the following problems

  1. cannot figure out how to add vertical lines at those specific dates, nor how to change the size formatting of them
  2. Y-axis ('cost') are in scientific notation (i.e. 2e+05) instead of regular numbers (i.e. 200,000)
  3. I don't quite understand subsetting in R; in Stata I can easily add "if" qualifiers to examine specific data subsets (e.g. "if cost > 3000 & transactionID < 5") and then easily modify them to re-run the analyses or plot the graphs on other various subsets. But in R it seems that there are extra steps involved where you have to subset the data first and store it as a new object and then run the analysis on that object. Is that right? I see some benefits to this, but also some demerits (such as having hundreds of different objects cluttering up your work environment as you explore the data, for instance).

So far I have pieced together the following code. I originally tried to do it with the base R installation commands plot() and text(), but it seems like it cannot be done in base R. So then I tried using the ggplot2 package but still can't quite figure it out like I could in Stata:

library(ggplot2)
ggplot(thedata, aes(date, cost)) + 
       geom_text( label = thedata$company, color="blue", vjust = 0) +  
       geom_vline( xintercept = as.numeric( thedata$date[
                      c(I don't know what goes here, or here)]), 
                  linetype="dotted", color="red")

As you can see, I cannot figure out how the coordinates for the xintercept of the geom_vline command work (and can't find it in the official help file), specifically when I want them to be dates (particularly dates that may or not be in the data frame), nor can I figure out how to change the thickness of the line.

like image 251
coip Avatar asked Dec 17 '14 18:12

coip


2 Answers

very nicely done question. If you are still interested in a base solution:

transactionID <- c(1, 2, 3, 4)
date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05"))
cost <- as.integer(c(1208, 23820, 402, 89943))
company <- c("ACo", "BInc", "CInd", "DOp")
thedata <- data.frame(transactionID, date, cost, company)


par(mar = c(5,7,3,2), tcl = .2, las = 1)
with(thedata, 
     plot(date, cost, xlab = 'Date', ylab = '', axes = FALSE, main = 'a plot'))
dseq <- seq.Date(as.Date('2006-01-01'), as.Date('2013-01-01'), by = 'year')
axis.Date(1, at = dseq, labels = format(dseq, format = '%Y'))
# axis.Date(1, at = seq.Date(min(date), max(date), by = 'year'))
axis(2, at = pretty(cost), 
     labels = format(pretty(cost), scientific = FALSE, big.mark = ','))
## add lines at specified dates
abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3,
       lty = c('dotted','dashed','solid'), col = c('red','blue','green'))
## add company labels
text(x = date, y = cost, pos = 3, xpd = NA,
     labels = ifelse(cost <= 3000, company, ''))
title(ylab = 'Cost', line = 5)
box('plot', bty = 'l')

enter image description here

To address some specific questions:

  1. I use as.Date. R stores dates similarly to stata

    abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3,
           lty = c('dotted','dashed','solid'), col = c('red','blue','green'))
    
  2. use formatting

    format(pretty(cost), scientific = FALSE, big.mark = ',')
    # [1] "      0" " 20,000" " 40,000" " 60,000" " 80,000" "100,000"
    
  3. you can of course create some subsets if you are more comfortable with that, but there is usually a way to do a one-liner in r

    ifelse(cost <= 3000, company, '')
    # [1] "ACo"  ""     "CInd" ""  
    

Most of the base plot functions are vectorized which is why this is so easy. And I am not a ggplot wizard, and it usually results in a headache for me when I try to do very specifically formatted plots like these. Generally, ggplot is good for nice, quick, dirty graphs. If you want something very specific or to publish, base r graphics is the way to go.

like image 100
rawr Avatar answered Sep 23 '22 16:09

rawr


So here is a ggplot method which, I think, produces what you are asking.

library(ggplot2)
key.events <- data.frame(date=as.Date(c("2007-09-10","2008-01-28","2012-01-18")))
ggplot(thedata[thedata$cost>3000,],aes(x=date,y=cost))+ 
  geom_point(shape=1,size=3)+
  geom_text(aes(label=company),vjust=-1)+
  scale_y_continuous(expand=c(0.2,0.2))+
  geom_vline(data=key.events, size=1,
             aes(xintercept=as.integer(date),color=factor(date),linetype=factor(date)))+
  scale_color_manual(values=c("red","blue","green"))+
  scale_linetype_manual(values=c("dotted","dashed","solid"))+
  theme_bw()

like image 36
jlhoward Avatar answered Sep 25 '22 16:09

jlhoward