Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

connect points across selected NAs with geom_line()

Tags:

r

ggplot2

My question is closely related to Connecting across missing values with geom_line, but it's a follow-up rather than a duplicate.

I have data with missing values NA. The data has been 'melted' in long form with package reshape2 and I am using ggplot2 to plot both geom_points() and geom_line(). In the example data, I have one group only, in my real data I have several groups. I would like to plot a geom_line() connecting data points that are not separated by more than 4 years of missing data. In other words, if there are 3 adjacent rows with NA, apply na.rm to the data.frame, while if there are at least 4 adjacent rows with NA, do not apply na.rm to the data.frame.

Edit: Note: I am replicating figures from a book, where the points are connected even when the data is missing. It would be better to use a different linetype or colour for those segments connecting missing data, together with a note in the legend explaining it.

In the following, I have a very tedious and ugly hack that will not scale up to manipulating large amounts of data. I'd be grateful for a simpler approach and particularly keen to find a simple way to count instances of consecutive NAs in the data.

### ggplot draws geom_line with NAs

# Data (real-world example, so not exactly MWE)
df <- 
structure(list(Year = c(1910, 1911, 1912, 1913, 1914, 1915, 1916, 
1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 
1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 
1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 
1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 
1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 
1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 
1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 
2005, 2006, 2007, 2008, 2009, 2010), variable = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c("France", "Germany", "Sweden", "Japan"
), class = c("ordered", "factor")), value = c(0.1724, 0.1748, 
0.1752, 0.1777, 0.1778, 0.1953, 0.2132, 0.2242, 0.222, 0.1947, 
NA, NA, NA, NA, NA, 0.113, 0.113, 0.115, 0.112, 0.111, NA, NA, 
0.114, 0.109, 0.113, 0.12, 0.137, 0.15, 0.163, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 0.116, NA, NA, NA, NA, NA, NA, 0.11, 
NA, NA, NA, 0.122, NA, NA, NA, 0.122, NA, NA, 0.112, NA, NA, 
0.113, NA, NA, 0.101, NA, NA, 0.102, NA, NA, 0.1043, NA, NA, 
0.0906, NA, NA, 0.0964, NA, NA, 0.1052, NA, NA, 0.1043, NA, NA, 
0.1005, NA, NA, 0.1088, NA, NA, 0.101139312657167, 0.0950290025146689, 
0.0901042749371333, 0.09, 0.107249622799665, 0.108891198658843, 
0.115913495389774, 0.110684772282761, 0.113299133836267, 0.111991953059514
)), .Names = c("Year", "variable", "value"), row.names = 102:202, class = "data.frame")

The default plot:

library("ggplot2")
ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line()

enter image description here

The plot with all NAs removed (see Connecting across missing values with geom_line):

ggplot(data = df, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + 
    geom_point(size = 3) + geom_line(data = df[!is.na(df$value), ])

enter image description here

The desired plot:

df2 <- df
df2[df2$Year == 1922, ]$value <- "-999999"
df2[df2$Year == 1948, ]$value <- "-999999"
df2 <- df2[!is.na(df2$value), ]
df2$value <- as.numeric(df2$value)
ggplot(data = df2, aes(x = Year, y = value, group = variable, colour = variable, shape = variable)) + geom_point(size = 3) + 
    geom_line() + scale_y_continuous(limit = c(.08, .23))

enter image description here

like image 456
PatrickT Avatar asked Dec 28 '14 12:12

PatrickT


1 Answers

This produces your "desired plot", with the exception noted in the comment.

x <- rle(!is.na(df$value))
x$values[which(x$lengths>3 & !x$values)] <- TRUE
indx <- inverse.rle(x)
library(ggplot2)
ggplot(df[indx,],aes(x=Year,y=value,color=variable))+
  geom_point(size=3)+
  geom_line()

Basically, we encode NA as FALSE, and everything else as TRUE, then perform run length encoding to identify sequences of T/F. Any sequence of FALSE of length > 3 should be kept, so we convert those to TRUE (as if they were not NA), then we use inverse rle to recover an index vector with TRUE if the row should be kept. Finally, we apply this to the df for use in ggplot.

like image 176
jlhoward Avatar answered Oct 03 '22 00:10

jlhoward