Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scatterplot of Year-On-Year Correlation of Data in R using ggplot2

Tags:

plot

r

ggplot2

I have some yearly football data that I would like to test to see if certain team metrics are repeatable in the next year. My data is in a data.frame and looks something like this:

                    y2003    y2004    y2005
Team 1           51.95455 51.00000 53.59091   
Team 2           54.18182 56.31818 49.09091   
Team 3           48.68182 46.86364 49.22727   
Team 4           50.86364 47.68182 48.72727   

What I want to be able to do is scatterplot this with "Year n" on the x-axis and "Year n+1" on the y-axis. So for example 2003 vs. 2004, 2004 vs. 2005, 2005 vs. 2006 etc. all on the same plot.

I would then like to be able to draw a line of best fit to see how strong the correlation is, whether it is repeatable or not.

What is the best way to do this in R with ggplot2? I can get the initial plot with:

 p=ggplot(df,aes(y2003,y2004))
 p + geom_point()

Then do I just have to add them all manually? Is there an inbuilt function for this sort of thing? And if I add them all one-by-one how will I get the best fit?

like image 656
Boris Cocker Avatar asked Jun 04 '15 13:06

Boris Cocker


1 Answers

You want a data frame with a row for each team-year combination, containing the data for that year and the next year as well as the team name. You can actually get this without any split-apply-combine manipulation using base R functions:

(to.plot <- data.frame(yearN=unlist(df[-ncol(df)]),
                       yearNp1=unlist(df[-1]),
                       team=rep(row.names(df), ncol(df)-1)))
#           yearN  yearNp1  team
# y20031 51.95455 51.00000 Team1
# y20032 54.18182 56.31818 Team2
# y20033 48.68182 46.86364 Team3
# y20034 50.86364 47.68182 Team4
# y20041 51.00000 53.59091 Team1
# y20042 56.31818 49.09091 Team2
# y20043 46.86364 49.22727 Team3
# y20044 47.68182 48.72727 Team4

Basically this code converts all but the last column of df into a vector (using unlist), storing them in variable yearN. The next year can be obtained by grabbing all but the first column of df into a vector. Finally, the team name can be obtained as a repeated sequence of the row names of df.

Getting a line of best fit is a simple linear regression model:

(coefs <- coef(lm(yearNp1~yearN, data=to.plot)))
# (Intercept)       yearN 
#  28.3611927   0.4308978 

Now ggplot can be used as usual for plotting:

library(ggplot2)
ggplot(to.plot, aes(x=yearN, y=yearNp1, col=team)) + geom_point() +
  geom_abline(intercept=coefs[1], slope=coefs[2])

enter image description here

like image 198
josliber Avatar answered Oct 03 '22 21:10

josliber