I have a data table shown below (dt <- fread("~/data.csv"):
chr gene_id S1 S2 S3 S4
chr1 a 30 50 70 90
chr2 b 40 60 80 100
chr3 c 50 70 90 120
chr4 d 60 80 100 130
As a final output, I want to normalize each column value by the sum of its column.
chr gene_id S1 S2 S3 S4
chr1 a 30/sum(S1) 50/sum(S2) 70/sum(S3) 90/sum(S4)
chr2 b 40/sum(S1) 60/sum(S2) 80/sum(S3) 100/sum(S4)
chr3 c 50/sum(S1) 70/sum(S2) 90/sum(S3) 120/sum(S4)
chr4 d 60/sum(S1) 80/sum(S2) 100/sum(S3) 130/sum(S4)
How can I get the final output above using data.table not dplyr from tidyverse?
You will only want to apply this change to numeric columns, so those must be identified. Then use lapply. This creates a NEW data.table. Chinsoon's answer will update by reference - which may be better depending on the scenario.
library(data.table)
iris_example <- data.table(iris)
numeric_cols <- names(iris)[sapply(iris, is.numeric)]
iris_example[, lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]
# if you want to update by reference
iris_example[, (numeric_cols) := lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]
You can use get
as follows:
cols <- paste0("S", 1:4)
dt[, (cols) := lapply(cols, function(x) get(x) / sum(get(x)))]
edit: added some timings for future reference
library(data.table)
DT <- as.data.table(matrix(rnorm(4*1e7), ncol=4))
#DT <- as.data.table(matrix(rnorm(4*1e4), nrow=4))
library(microbenchmark)
microbenchmark(
DT[, lapply(.SD, function(x) x/sum(x)), .SDcols=names(DT)],
DT[, .SD / lapply(.SD, sum), .SDcols=names(DT)],
DT[, lapply(names(DT), function(x) get(x) / sum(get(x)))],
#if you change ncol to nrow in matrix function above, you will understand why pple always ask you to go long format
melt(DT)[, value / sum(value), by=variable],#[,
#lapply(names(DT), function(x) .SD[variable==x, V1])],
times=3L
)
timings:
Unit: milliseconds
expr min lq mean median uq max neval
DT[, lapply(.SD, function(x) x/sum(x)), .SDcols = names(DT)] 245.9107 250.3230 278.4220 254.7353 294.6776 334.6200 3
DT[, .SD/lapply(.SD, sum), .SDcols = names(DT)] 534.5416 540.0630 602.4990 545.5845 636.4777 727.3708 3
DT[, lapply(names(DT), function(x) get(x)/sum(get(x)))] 248.1819 381.0036 440.7233 513.8254 536.9940 560.1625 3
melt(DT)[, value/sum(value), by = variable] 1219.2584 1250.5846 1425.0604 1281.9107 1527.9614 1774.0120 3
sessionInfo:
data.table_1.10.4-3
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With