Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find column value and name based on minimum value in other column

Tags:

r

data.table

I have a data.table that looks like this

library( data.table )

dt <- data.table( p1 = c("a", "b", "c", "d", "e", "f", "g"), 
                  p2 = c("b", "c", "d", "a", "f", "g", "h"), 
                  p3 = c("z", "x", NA, NA, "y", NA, "s"), 
                  t1 = c(1, 2, 3, NA, 5, 6, 7), 
                  t2 = c(7, 6, 5, NA, 3, 2, NA), 
                  t3 = c(8, 3, NA, NA, 2, NA, 1) )

#    p1 p2   p3 t1 t2 t3
# 1:  a  b    z  1  7  8
# 2:  b  c    x  2  6  3
# 3:  c  d <NA>  3  5 NA
# 4:  d  a <NA> NA NA NA
# 5:  e  f    y  5  3  2
# 6:  f  g <NA>  6  2 NA
# 7:  g  h    s  7 NA  1

It has p-columns, representing names, and t-columns, representing values. t1 is the value corresponding to p1, t2 to p2, etc..
On each row, values of p-columns are unique (or NA). The same goes for the values in the t-columns.

What I want to do is to create three new columns:

  • t_min, the minimum value of all t-columns for each row (exclude NA's)
  • p_min, if t_min exists (is not NA), the corresponding value of the p-column... so if the t2-column has the t-min value, the corresponding value of column p2.
  • p_col_min, the name of the column with the value if p_min. So if the p_min value comes from colum p2, then "p2".

I prefer a data.table, since my actual data contains a lot more rows and columns. I know melting is an option, but I would like to preserve my memory with this data, so lesser memory used is better (production data contains several million rows and >200 columns).

So far I've found a way to create the t_min-column using the following:

t_cols = dt[ , .SD, .SDcols = grep( "t[1-3]", names( dt ), value = TRUE ) ]
dt[ !all( is.na( t_cols ) ), 
    t_min := do.call( pmin, c( .SD, list( na.rm = TRUE ) ) ), 
    .SDcols = names( t_cols ) ]

But I cannot wrap my head around creating the p_min and p_col_min columns. I suppose which.min() comes into play somewhere, but I cannot figure it out. Probably something simple I'm overlooking (it always seems to be.. ;-) ).

desired output

dt.desired <- data.table( p1 = c("a", "b", "c", "d", "e", "f", "g"), 
                          p2 = c("b", "c", "d", "a", "f", "g", "h"), 
                          p3 = c("z", "x", NA, NA, "y", NA, "s"), 
                          t1 = c(1, 2, 3, NA, 5, 6, 7), 
                          t2 = c(7, 6, 5, NA, 3, 2, NA), 
                          t3 = c(8, 3, NA, NA, 2, NA, 1),
                          t_min = c(1,2,3,NA,2,2,1),
                          p_min = c("a","b","c",NA,"y","g","s"),
                          p_col_min = c("p1","p1","p1",NA,"p3","p2","p3") )

#    p1 p2   p3 t1 t2 t3 t_min p_min p_col_min
# 1:  a  b    z  1  7  8     1     a        p1
# 2:  b  c    x  2  6  3     2     b        p1
# 3:  c  d <NA>  3  5 NA     3     c        p1
# 4:  d  a <NA> NA NA NA    NA  <NA>      <NA>
# 5:  e  f    y  5  3  2     2     y        p3
# 6:  f  g <NA>  6  2 NA     2     g        p2
# 7:  g  h    s  7 NA  1     1     s        p3
like image 260
Wimpel Avatar asked Jan 25 '23 10:01

Wimpel


2 Answers

I cannot guarantee whether this is a solution efficient enough for your working data, but this is what I would try first:

m1 <- as.matrix(dt[, grep('^t', names(dt)), with = FALSE])
m2 <- as.matrix(dt[, grep('^p', names(dt)), with = FALSE])

t_min <- apply(m1, 1, min, na.rm = TRUE)
t_min[is.infinite(t_min)] <- NA_real_
p_min_index <- rep(NA_integer_, length(t_min))
p_min_index[!is.na(t_min)] <- apply(m1[!is.na(t_min), ], 1, which.min)

dt[, t_min  := t_min]
dt[, p_min := m2[cbind(seq_len(nrow(m2)), p_min_index)] ]
dt[, p_min_col := grep('^p', names(dt), value = TRUE)[p_min_index] ]


#    p1 p2   p3 t1 t2 t3 t_min p_min p_min_col
# 1:  a  b    z  1  7  8     1     a        p1
# 2:  b  c    x  2  6  3     2     b        p1
# 3:  c  d <NA>  3  5 NA     3     c        p1
# 4:  d  a <NA> NA NA NA    NA  <NA>      <NA>
# 5:  e  f    y  5  3  2     2     y        p3
# 6:  f  g <NA>  6  2 NA     2     g        p2
# 7:  g  h    s  7 NA  1     1     s        p3

In addition, It looks like that the 2nd row in your desired output is incorrect?

like image 143
mt1022 Avatar answered Jan 28 '23 01:01

mt1022


A simple and efficient approach is to loop through the "t*" columns and track all respective values in a single pass.

First initialize appropriate vectors:

p.columns = which(startsWith(names(dt), "p"))
t.columns = which(startsWith(names(dt), "t"))

p_col_min = integer(nrow(dt))
p_min = character(nrow(dt))
t_min = rep_len(Inf, nrow(dt))

and iterate while updating:

for(i in seq_along(p.columns)) {
    cur.min = which(dt[[t.columns[i]]] < t_min)

    p_col_min[cur.min] = p.columns[i]

    t_min[cur.min] = dt[[t.columns[i]]][cur.min]
    p_min[cur.min] = dt[[p.columns[i]]][cur.min]
}

Finally fill with NAs where needed:

whichNA = is.infinite(t_min)
is.na(t_min) = is.na(p_min) = is.na(p_col_min) = whichNA

t_min
#[1]  1  2  3 NA  2  2  1
p_min
#[1] "a" "b" "c" NA  "y" "g" "s"
p_col_min
#[1]  1  1  1 NA  3  2  3
like image 41
alexis_laz Avatar answered Jan 28 '23 01:01

alexis_laz