Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diagonals to rows in data.frame

I have a matrix-like data frame with an additional column denoting time. It contains information on the number of enrolled students in a given school, from grade 5 (column A) to grade 9 (column E).

  time    A    B    C    D    E
1   13 1842 1844 1689 1776 1716
2   14 1898 1785 1807 1617 1679
3   15 2065 1865 1748 1731 1590
4   16 2215 1994 1811 1708 1703
5   17 2174 2122 1903 1765 1699

I need to trace the size of the cohort over time, meaning that I need row-wise information on how many fifth graders from each starting year remained in the school from grades 6 through 9. For example, for the cohort that has begun fifth grade in 2013, I want information on how many remained in sixth grade in 2014, and so on.

Expected output

This is what I would like to end up with:

  start.time point.A point.B point.C point.D point.E
1         13    1842    1785    1748    1708    1699
2         14    1898    1865    1811    1765      NA
3         15    2065    1811    1765      NA      NA
4         16    2215    1765      NA      NA      NA
5         17    2174      NA      NA      NA      NA

I have looked at diag() from base.R, but I could only get the the data from the main diagonal. Ideally, I'd like to accomplish this using dplyr syntax and the pipe.

Data

structure(list(time = 13:17, A = c(1842, 1898, 2065, 2215, 2174), B = c(1844, 1785, 1865, 1994, 2122), C = c(1689, 1807, 1748, 1811, 1903), D = c(1776, 1617, 1731, 1708, 1765), E = c(1716, 1679, 1590, 1703, 1699)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), vars = "time", drop = TRUE, indices = list(
0L, 1L, 2L, 3L, 4L), group_sizes = c(1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
time = 13:17), class = "data.frame", row.names = c(NA, -5L), vars = "time", drop = TRUE, .Names = "time"), .Names = c("time", "A", "B", "C", "D", "E"))
like image 764
tifu Avatar asked Jan 03 '23 20:01

tifu


2 Answers

Convert the input DF except for the first column to a matrix mat. Then since row(mat) - col(mat) is constant on diagonals split with respect to that creating a list of ts class series in L. We used ts class since we can later cbind them even if they are of different lengths. The diagonals for which row(mat) - col(mat) >= 0 are the only ones we want so pick off those, cbind them together and transpose the result. Then replace all columns in DF except the first with that. No packages are used.

mat <- as.matrix(DF[-1])
L <- lapply(split(mat, row(mat) - col(mat)), ts)
replace(DF, -1, t(do.call("cbind", L[as.numeric(names(L)) >= 0])))

giving:

  time    A    B    C    D    E
1   13 1842 1785 1748 1708 1699
2   14 1898 1865 1811 1765   NA
3   15 2065 1994 1903   NA   NA
4   16 2215 2122   NA   NA   NA
5   17 2174   NA   NA   NA   NA
like image 59
G. Grothendieck Avatar answered Jan 05 '23 14:01

G. Grothendieck


Since you mentioned dplyr in your question, you could use dplyr::lead to shift the values of columns B to E by 1, 2 etc. respectively, and then bind the result with columns time and A from your original data as follows

library(tidyverse)
bind_cols(df[, 1:2], map2_df(.x = df[, c(3:ncol(df))],
                             .y = seq_along(df[, 3:ncol(df)]), 
                             .f = ~dplyr::lead(x = .x, n = .y)))
#  A tibble: 5 x 6
#  Groups:   time [5]
#   time     A     B     C     D     E
#  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1    13  1842  1785  1748  1708  1699
#2    14  1898  1865  1811  1765    NA
#3    15  2065  1994  1903    NA    NA
#4    16  2215  2122    NA    NA    NA
#5    17  2174    NA    NA    NA    NA

Note that your data is grouped by time the way you provided it.

like image 26
markus Avatar answered Jan 05 '23 16:01

markus