Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I normalize each column value by the sum of its column with data.table library?

Tags:

r

data.table

I have a data table shown below (dt <- fread("~/data.csv"):

chr    gene_id       S1         S2         S3           S4
chr1      a          30         50         70           90
chr2      b          40         60         80           100
chr3      c          50         70         90           120
chr4      d          60         80         100          130

As a final output, I want to normalize each column value by the sum of its column.

chr    gene_id       S1              S2           S3            S4
chr1     a        30/sum(S1)     50/sum(S2)    70/sum(S3)   90/sum(S4)
chr2     b        40/sum(S1)     60/sum(S2)    80/sum(S3)  100/sum(S4)
chr3     c        50/sum(S1)     70/sum(S2)    90/sum(S3)  120/sum(S4)
chr4     d        60/sum(S1)     80/sum(S2)   100/sum(S3)  130/sum(S4)

How can I get the final output above using data.table not dplyr from tidyverse?

like image 473
user7755336 Avatar asked Mar 06 '23 04:03

user7755336


2 Answers

You will only want to apply this change to numeric columns, so those must be identified. Then use lapply. This creates a NEW data.table. Chinsoon's answer will update by reference - which may be better depending on the scenario.

library(data.table)
iris_example <- data.table(iris)

numeric_cols <- names(iris)[sapply(iris, is.numeric)]

iris_example[, lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]

# if you want to update by reference
iris_example[, (numeric_cols) := lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]
like image 196
zacdav Avatar answered Apr 07 '23 06:04

zacdav


You can use get as follows:

cols <- paste0("S", 1:4)
dt[, (cols) := lapply(cols, function(x) get(x) / sum(get(x)))]

edit: added some timings for future reference

library(data.table) 

DT <- as.data.table(matrix(rnorm(4*1e7), ncol=4))
#DT <- as.data.table(matrix(rnorm(4*1e4), nrow=4))

library(microbenchmark)
microbenchmark(

    DT[, lapply(.SD, function(x) x/sum(x)), .SDcols=names(DT)],

    DT[, .SD / lapply(.SD, sum), .SDcols=names(DT)],

    DT[, lapply(names(DT), function(x) get(x) / sum(get(x)))],

    #if you change ncol to nrow in matrix function above, you will understand why pple always ask you to go long format
    melt(DT)[, value / sum(value), by=variable],#[,
        #lapply(names(DT), function(x) .SD[variable==x, V1])],

    times=3L
)

timings:

Unit: milliseconds
                                                         expr       min        lq      mean    median        uq       max neval
 DT[, lapply(.SD, function(x) x/sum(x)), .SDcols = names(DT)]  245.9107  250.3230  278.4220  254.7353  294.6776  334.6200     3
              DT[, .SD/lapply(.SD, sum), .SDcols = names(DT)]  534.5416  540.0630  602.4990  545.5845  636.4777  727.3708     3
      DT[, lapply(names(DT), function(x) get(x)/sum(get(x)))]  248.1819  381.0036  440.7233  513.8254  536.9940  560.1625     3
                  melt(DT)[, value/sum(value), by = variable] 1219.2584 1250.5846 1425.0604 1281.9107 1527.9614 1774.0120     3

sessionInfo:

data.table_1.10.4-3 
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
like image 25
chinsoon12 Avatar answered Apr 07 '23 08:04

chinsoon12