I am interested in, for example, replacing (nearly) all of the columns of data.frame
or tibble
with columns where the row minimum has been subtracted from each row. For example, if X
is a numerical matrix, then in base R I would write:
X = sweep(X, 1, apply(X, 1, min))
My current function for doing this using the data I have--I'll explain the format momentarily--pulls out the numerical columns into a matrix, does the sweep, then cbind
s the transformed data and the non-numerical data back together again. That is:
subtractMin = function(data){
X = data %>%
select(starts_with("X")) %>%
as.matrix()
X = sweep(X, 1, apply(X, 1, min))
labels = data %>%
select(-starts_with("X"))
return(cbind(labels, X))
}
This strikes me as inefficient, and there must be a smarter way.
I don't think it is important to know given the context, but my data has 77 rows and 1133 columns. Four of the columns contain label information, and the remaining 1129 contain the numerical measurements for each observation (they're spectra if you care). The number of numerical variables is such that individual mutate
s are not a way forward. Equally - you still need to know the row minimum to make the standardisation for each row.
I have been asked to add some data. The original data has over 1,000 columns, so I will provide a smaller data set
> x.df
nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1 13901.944 13889.056 13883.334 01 c 1
2 s001c2 17293.586 17279.375 17291.365 01 c 2
3 s001c3 8011.764 8028.584 8033.548 01 c 3
4 s001c4 7499.272 7510.719 7517.064 01 c 4
5 s001c5 20300.408 20293.604 20297.185 01 c 5
I know you asked for tidyverse
/ dplyr
but in case you overlooked base R, here's a solution:
ind <- !names(df) %in% c("nm","source","color","rep")
df[ind] <- df[ind] - do.call(pmin, df[ind])
df
# nm X1799.38928 X1798.01526 X1796.64124 source color rep
# 1 s001c1 18.610 5.722 0.000 1 c 1
# 2 s001c2 14.211 0.000 11.990 1 c 2
# 3 s001c3 0.000 16.820 21.784 1 c 3
# 4 s001c4 0.000 11.447 17.792 1 c 4
# 5 s001c5 6.804 0.000 3.581 1 c 5
And I suppose this would qualify as a tidyverse
solution (though not very idiomatic):
df %>%
split.default(!names(df) %in% c("nm","source","color","rep")) %>%
map_at("TRUE", ~ .x - invoke(pmin,.x)) %>%
bind_cols
# nm source color rep X1799.38928 X1798.01526 X1796.64124
# 1 s001c1 1 c 1 18.610 5.722 0.000
# 2 s001c2 1 c 2 14.211 0.000 11.990
# 3 s001c3 1 c 3 0.000 16.820 21.784
# 4 s001c4 1 c 4 0.000 11.447 17.792
# 5 s001c5 1 c 5 6.804 0.000 3.581
(For what it's worth, I think the down-votes are a bit harsh and unwarranted here. The problem statement is clear, and sample data has been included in an edit.)
You can achieve what you're after by converting data in numeric columns from wide to long (using gather
), grouping by rows (using group_by
), subtracting the minimum (using mutate
), and converting back from long to wide (using spread
).
library(tidyverse)
df %>%
gather(k, v, starts_with("X")) %>%
group_by(nm) %>%
mutate(v = v - min(v)) %>%
spread(k, v) %>%
select(names(df))
## A tibble: 5 x 7
## Groups: nm [5]
# nm X1799.38928 X1798.01526 X1796.64124 source color rep
# <fct> <dbl> <dbl> <dbl> <int> <fct> <int>
#1 s001c1 18.6 5.72 0. 1 c 1
#2 s001c2 14.2 0. 12.0 1 c 2
#3 s001c3 0. 16.8 21.8 1 c 3
#4 s001c4 0. 11.4 17.8 1 c 4
#5 s001c5 6.80 0. 3.58 1 c 5
df <- read.table(text =
"nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1 13901.944 13889.056 13883.334 01 c 1
2 s001c2 17293.586 17279.375 17291.365 01 c 2
3 s001c3 8011.764 8028.584 8033.548 01 c 3
4 s001c4 7499.272 7510.719 7517.064 01 c 4
5 s001c5 20300.408 20293.604 20297.185 01 c 5")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With