Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table with variable number of columns

Tags:

r

data.table

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.

The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.

Here is a small working example of "choose a specific column set for each row problem."

set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
              A = sample(c(1:5, 9), 10,  replace = TRUE),
              B = sample(c(1:5, 9), 10, replace = TRUE),
              C = sample(c(1:5, 9), 10, replace = TRUE),
              D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises

## Students in various grades are administered different
## tests.  A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
           "4" = c("A", "C"),
           "5" = c("B", "C", "D"),
           "6" = c("A", "B", "C", "D"),
           "7" = c("C", "D"),
           "8" = c("C"))

## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
    lookup[[as.character(grade)]]
}


## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
    scores <- arow[getLookup(arow["grade"])]
    mean(scores[scores != 9], na.rm = TRUE)
}

stuscores <- apply(dat, 1, function(x) getMean(x, lookup))

result <- data.frame(dat, stuscores)
result

## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.

## Client will want students sorted by state, district, classroom,
## etc.

## However, am stumped on how to specify the adjustable
## column-name chooser

library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)

The desired output is the student average for the appropriate columns, like so:

> result
  id grade A B C D stuscores
1   1     3 9 9 1 4       NaN
2   2     4 5 4 1 5       3.0
3   3     5 1 3 5 9       4.0
4   4     6 5 2 4 5       4.0
5   5     7 9 1 1 3       2.0
6   6     3 3 3 4 3       3.0
7   7     4 9 2 9 2       NaN
8   8     5 3 9 2 9       2.0
9   9     6 2 3 2 5       3.0
10 10     7 3 2 4 1       2.5

Then what? I've written a lot of mistakes so far...

I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.

I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.

like image 966
pauljohn32 Avatar asked Mar 15 '23 08:03

pauljohn32


2 Answers

First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.

Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)

So using set.seed(123), your function gives

apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000

And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)

## convert all 9 values to NAs
is.na(dat) <- dat == 9L 
## convert your original data to `data.table`, 
## there is no need in additional copy of the data if the data is huge
setDT(dat)     
## loop only over the list
for(i in names(lookup)) {
  dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
#     id grade  A  B  C  D      res
#  1:  1     3  2 NA NA NA 2.000000
#  2:  2     4  5  3  5 NA 5.000000
#  3:  3     5  3  5  4  5 4.666667
#  4:  4     6 NA  4 NA  5 4.500000
#  5:  5     7 NA  1  4  1 2.500000
#  6:  6     3  1 NA  5  3 1.000000
#  7:  7     4  4  2  4  5 4.000000
#  8:  8     5 NA  1  4  2 2.333333
#  9: NA     6  4  2  2  2 2.500000
# 10: 10     7  3 NA  1  2 1.500000

Possibly, this could be improved utilizing set, but I can't think of a good way currently.


P.S.

As suggested by @Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.

like image 190
David Arenburg Avatar answered Mar 17 '23 21:03

David Arenburg


Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:

DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
  variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
  .(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
#    id grade A B C D mean_score
# 1:  1     3 9 9 1 4        NaN
# 2:  2     4 5 4 1 5        3.0
# 3:  3     5 1 3 5 9        4.0
# 4:  4     6 5 2 4 5        4.0
# 5:  5     7 9 1 1 3        2.0
# 6:  6     3 3 3 4 3        3.0
# 7:  7     4 9 2 9 2        NaN
# 8:  8     5 3 9 2 9        2.0
# 9:  9     6 2 3 2 5        3.0
#10: 10     7 3 2 4 1        2.5

It's more verbose, but just over twice as fast:

microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369  1000
# nk_method()  7.047405  7.282276  7.757005  7.489351  7.667614  20.30658  1000
like image 23
Nick Kennedy Avatar answered Mar 17 '23 21:03

Nick Kennedy