Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Divide all rows by a reference row, by group

Tags:

r

Here is a sample table I'm working with:

n = c(rep("A",3),rep("B",3),rep("C",3))
m = c("X", "Y", "Z", "X", "Y", "Z", "X", "Y", "Z")
s = 1:9 
b = 5:13
c = 20:28
d = c(rep("abc", 9))
df = data.frame(d, n, m, s, b, c) 
df

Below is what the table looks like:

d   n   m   s   b   c
abc A   X   1   5   20
abc A   Y   2   6   21
abc A   Z   3   7   22
abc B   X   4   8   23
abc B   Y   5   9   24
abc B   Z   6   10  25
abc C   X   7   11  26
abc C   Y   8   12  27
abc C   Z   9   13  28

I'll refer to each row as a concatenation of its column n and m values (e.g. AX row, CZ row, etc.) I would like to divide each of the A rows by the AY row, each of the B rows by the BY row, and each of the C rows by the CY row (may not always be Y, sometimes X or Z). I essentially want to rebase the data (columns s, b, and c) by group (where column n is the group), using X, Y, or Z (column m) as the base.

I need columns d, n, and m to remain untouched. If possible, I'd like to do this by referencing X, Y, or Z in the code directly to denote which row will be the base, rather than by [1], [2], or [3] (as they may not always be in the same order, and it's more intuitive to the user). I'm new to R and using dplyr but I haven't been able to figure out a good way of doing this.

Thanks for your help.

like image 563
timmyb357 Avatar asked Jul 20 '17 19:07

timmyb357


1 Answers

Using data.table.

library(data.table)

setDT(df)

divselect <- "Y"

set(df, j = "s", value = as.numeric(df[["s"]]))
set(df, j = "b", value = as.numeric(df[["b"]]))
set(df, j = "c", value = as.numeric(df[["c"]]))

The set commands are to avoid an error. The columns currently are integer, but you're going to be making them double. If in your real world example they're already double this won't be necessary.

The value of divselect changes which column rows you're using as your base. You can change this to X or Z as needed.

df[, `:=`(s = s/s[m == divselect],
          b = b/b[m == divselect],
          c = c/c[m == divselect]),
   by = n]

Result:

#      d n m     s         b         c
# 1: abc A X 0.500 0.8333333 0.9523810
# 2: abc A Y 1.000 1.0000000 1.0000000
# 3: abc A Z 1.500 1.1666667 1.0476190
# 4: abc B X 0.800 0.8888889 0.9583333
# 5: abc B Y 1.000 1.0000000 1.0000000
# 6: abc B Z 1.200 1.1111111 1.0416667
# 7: abc C X 0.875 0.9166667 0.9629630
# 8: abc C Y 1.000 1.0000000 1.0000000
# 9: abc C Z 1.125 1.0833333 1.0370370

Followup

I have one question: is there a way to generalize the columns that get rebased? I'd like this code to be able to handle additional numeric columns (more than 3 without calling each out specifically). i.e. Can I define the division to happen to all columns except d, n, and m?

Yes, you can do this by using lapply either inside or outside the data.table.

setDT(df)

divselect <- "Y"

funcnumeric <- function(x) {
  set(df, j = x, value = as.numeric(df[[x]]))
  NULL
}

modcols <- names(df)[!(names(df) %in% c("d", "n", "m"))]

a <- lapply(modcols, funcnumeric)

This replaces the three set commands in the first answer. Instead of specifying each, we use lapply to perform the function on each column that is not d, n, or m. Note that I return NULL to avoid messy function return text; since this is data.table it is all done in place.

funcdiv <- function(x, pos) {
  x/x[pos]
}

df[ , (modcols) := lapply(.SD, 
                          funcdiv, 
                          pos = which(m == divselect)), 
    by = n, 
    .SDcols = modcols]

This is done slightly different than before. Here we create a simple function that will divide a vector by that vector's value a the position specified by the pos parameter. We apply that to each column in .SD, and also pass the pos value as the position where the m column is equal to the value of divselect, in this case it is equal to Y. Since we are specifying by = n both the vector and pos arguments to funcdiv will be determined for each value in n. The parameter .SDcols specifies that we want to lapply this function, which is the same set of columns that we assigned to the variable modcols. We assign all of this back to modcols in place.

Result:

#      d n m     s         b         c
# 1: abc A X 0.500 0.8333333 0.9523810
# 2: abc A Y 1.000 1.0000000 1.0000000
# 3: abc A Z 1.500 1.1666667 1.0476190
# 4: abc B X 0.800 0.8888889 0.9583333
# 5: abc B Y 1.000 1.0000000 1.0000000
# 6: abc B Z 1.200 1.1111111 1.0416667
# 7: abc C X 0.875 0.9166667 0.9629630
# 8: abc C Y 1.000 1.0000000 1.0000000
# 9: abc C Z 1.125 1.0833333 1.0370370 
like image 65
Eric Watt Avatar answered Nov 15 '22 04:11

Eric Watt