Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sweep-like operations with dplyr/tidyverse

Tags:

r

dplyr

tidyverse

I am interested in, for example, replacing (nearly) all of the columns of data.frame or tibble with columns where the row minimum has been subtracted from each row. For example, if X is a numerical matrix, then in base R I would write:

X = sweep(X, 1, apply(X, 1, min))

My current function for doing this using the data I have--I'll explain the format momentarily--pulls out the numerical columns into a matrix, does the sweep, then cbinds the transformed data and the non-numerical data back together again. That is:

subtractMin = function(data){
  X = data %>% 
    select(starts_with("X")) %>% 
    as.matrix()

  X = sweep(X, 1, apply(X, 1, min))

  labels = data %>% 
    select(-starts_with("X"))

  return(cbind(labels, X))
}

This strikes me as inefficient, and there must be a smarter way.

I don't think it is important to know given the context, but my data has 77 rows and 1133 columns. Four of the columns contain label information, and the remaining 1129 contain the numerical measurements for each observation (they're spectra if you care). The number of numerical variables is such that individual mutates are not a way forward. Equally - you still need to know the row minimum to make the standardisation for each row.

I have been asked to add some data. The original data has over 1,000 columns, so I will provide a smaller data set

> x.df
      nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1   13901.944   13889.056   13883.334     01     c   1
2 s001c2   17293.586   17279.375   17291.365     01     c   2
3 s001c3    8011.764    8028.584    8033.548     01     c   3
4 s001c4    7499.272    7510.719    7517.064     01     c   4
5 s001c5   20300.408   20293.604   20297.185     01     c   5
like image 581
James Curran Avatar asked Aug 09 '18 03:08

James Curran


2 Answers

I know you asked for tidyverse / dplyr but in case you overlooked base R, here's a solution:

ind <- !names(df) %in% c("nm","source","color","rep")
df[ind] <- df[ind] - do.call(pmin, df[ind])
df
#       nm X1799.38928 X1798.01526 X1796.64124 source color rep
# 1 s001c1      18.610       5.722       0.000      1     c   1
# 2 s001c2      14.211       0.000      11.990      1     c   2
# 3 s001c3       0.000      16.820      21.784      1     c   3
# 4 s001c4       0.000      11.447      17.792      1     c   4
# 5 s001c5       6.804       0.000       3.581      1     c   5

And I suppose this would qualify as a tidyverse solution (though not very idiomatic):

df %>% 
  split.default(!names(df) %in% c("nm","source","color","rep")) %>%
  map_at("TRUE", ~ .x - invoke(pmin,.x)) %>%
  bind_cols
#       nm source color rep X1799.38928 X1798.01526 X1796.64124
# 1 s001c1      1     c   1      18.610       5.722       0.000
# 2 s001c2      1     c   2      14.211       0.000      11.990
# 3 s001c3      1     c   3       0.000      16.820      21.784
# 4 s001c4      1     c   4       0.000      11.447      17.792
# 5 s001c5      1     c   5       6.804       0.000       3.581
like image 27
Moody_Mudskipper Avatar answered Oct 18 '22 12:10

Moody_Mudskipper


(For what it's worth, I think the down-votes are a bit harsh and unwarranted here. The problem statement is clear, and sample data has been included in an edit.)

You can achieve what you're after by converting data in numeric columns from wide to long (using gather), grouping by rows (using group_by), subtracting the minimum (using mutate), and converting back from long to wide (using spread).

library(tidyverse)
df %>%
    gather(k, v, starts_with("X")) %>%
    group_by(nm) %>%
    mutate(v = v - min(v)) %>%
    spread(k, v) %>%
    select(names(df))
## A tibble: 5 x 7
## Groups:   nm [5]
#  nm     X1799.38928 X1798.01526 X1796.64124 source color   rep
#  <fct>        <dbl>       <dbl>       <dbl>  <int> <fct> <int>
#1 s001c1       18.6         5.72        0.        1 c         1
#2 s001c2       14.2         0.         12.0       1 c         2
#3 s001c3        0.         16.8        21.8       1 c         3
#4 s001c4        0.         11.4        17.8       1 c         4
#5 s001c5        6.80        0.          3.58      1 c         5

Sample data

df <- read.table(text =
    "nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1   13901.944   13889.056   13883.334     01     c   1
2 s001c2   17293.586   17279.375   17291.365     01     c   2
3 s001c3    8011.764    8028.584    8033.548     01     c   3
4 s001c4    7499.272    7510.719    7517.064     01     c   4
5 s001c5   20300.408   20293.604   20297.185     01     c   5")
like image 181
Maurits Evers Avatar answered Oct 18 '22 12:10

Maurits Evers