data.table row-wise sum, mean, min, max like dplyr?

Tags:

There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario

My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?

EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.

| Solution By | Speed compared to dplyr     | |-------------|-----------------------------| | Metrics v1  |  4.3 times SLOWER (use .SD) | | Metrics v2  |  5.6 times FASTER           | | ExperimenteR| 15   times FASTER           | | Arun v1     |  3   times FASTER (Map func)| | Arun v2     |  3   times FASTER (foo func)| | Ista        |  4.5 times FASTER           |

EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.

Data Setup

library(data.table) dt <- data.table(ProductName = c("Lettuce", "Beetroot", "Spinach", "Kale", "Carrot"),     Country = c("CA", "FR", "FR", "CA", "CA"),     Q1 = c(NA, 61, 40, 54, NA), Q2 = c(22,  8, NA,  5, NA),     Q3 = c(51, NA, NA, 16, NA), Q4 = c(79, 10, 49, NA, NA))  #    ProductName Country Q1 Q2 Q3 Q4 # 1:     Lettuce      CA NA 22 51 79 # 2:    Beetroot      FR 61  8 NA 10 # 3:     Spinach      FR 40 NA NA 49 # 4:        Kale      CA 54  5 16 NA # 5:      Carrot      CA NA NA NA NA

SOLUTION using dplyr + rowwise()

library(dplyr) ; library(magrittr) dt %>% rowwise() %>%      transmute(ProductName, Country, Q1, Q2, Q3, Q4,      AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),      MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),      MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),      SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),      NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))  #   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM NAcnt # 1     Lettuce      CA NA 22 51 79 50.66667  22   79 152     1 # 2    Beetroot      FR 61  8 NA 10 26.33333   8   61  79     1 # 3     Spinach      FR 40 NA NA 49 44.50000  40   49  89     2 # 4        Kale      CA 54  5 16 NA 25.00000   5   54  75     1 # 5      Carrot      CA NA NA NA NA      NaN Inf -Inf   0     4

ERROR with data.table (compute entire column instead of per-row)

dt[, .(ProductName, Country, Q1, Q2, Q3, Q4,     AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),     MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),     MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),     SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),     NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))]  #    ProductName Country Q1 Q2 Q3 Q4      AVG MIN MAX SUM NAcnt # 1:     Lettuce      CA NA 22 51 79 35.90909   5  79 395     9 # 2:    Beetroot      FR 61  8 NA 10 35.90909   5  79 395     9 # 3:     Spinach      FR 40 NA NA 49 35.90909   5  79 395     9 # 4:        Kale      CA 54  5 16 NA 35.90909   5  79 395     9 # 5:      Carrot      CA NA NA NA NA 35.90909   5  79 395     9

ALMOST solution but more complex and missing Q1,Q2,Q3,Q4 output columns

dtmelt <- reshape2::melt(dt, id=c("ProductName", "Country"),             variable.name="Quarter", value.name="Qty")  dtmelt[, .(AVG = mean(Qty, na.rm=TRUE),     MIN = min (Qty, na.rm=TRUE),     MAX = max (Qty, na.rm=TRUE),     SUM = sum (Qty, na.rm=TRUE),     NAcnt= sum(is.na(Qty))), by = list(ProductName, Country)]  #    ProductName Country      AVG MIN  MAX SUM NAcnt # 1:     Lettuce      CA 50.66667  22   79 152     1 # 2:    Beetroot      FR 26.33333   8   61  79     1 # 3:     Spinach      FR 44.50000  40   49  89     2 # 4:        Kale      CA 25.00000   5   54  75     1 # 5:      Carrot      CA      NaN Inf -Inf   0     4

619

asked Jul 07 '15 01:07

Polymerase

2 Answers

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats) dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),           MAX = rowMaxs(as.matrix(.SD), na.rm=T),           AVG = rowMeans(.SD, na.rm=T),           SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]  dt #    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM # 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152 # 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79 # 3:     Spinach      FR 40 NA 79 49  40   79 56.00000 168 # 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75 # 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt)) system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),                       MAX = rowMaxs(as.matrix(.SD), na.rm=T),                       AVG = rowMeans(.SD, na.rm=T),                       SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")]) #  user  system elapsed  # 0.089   0.004   0.093

rowwise (or by=1:nrow(dt)) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr) system.time(dt %>% rowwise() %>%    transmute(ProductName, Country, Q1, Q2, Q3, Q4,             MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),             MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),             AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),             SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE))) #   user  system elapsed  # 80.832   0.111  80.974   system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] ) #    user  system elapsed  # 141.492   0.196 141.757

answered Sep 17 '22 14:09

ExperimenteR

With by=1:nrow(dt), performs the rowwise operation in data.table

 library(data.table) dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c(Q1, Q2,Q3,Q4),by=1:nrow(dt)]     ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM 1:     Lettuce      CA NA 22 51 79 50.66667  22   79 152 2:    Beetroot      FR 61  8 NA 10 26.33333   8   61  79 3:     Spinach      FR 40 NA 79 49 56.00000  40   79 168 4:        Kale      CA 54  5 16 NA 25.00000   5   54  75 5:      Carrot      CA NA NA NA NA      NaN Inf -Inf   0  Warning messages: 1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :   no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :   no non-missing arguments to max; returning -Inf

You got warning messages, because in row 5, you are computing max, sum, min, and max of nothing. For example, see below:

min(c(NA,NA,NA,NA),na.rm=TRUE) [1] Inf Warning message: In min(c(NA, NA, NA, NA), na.rm = TRUE) :   no non-missing arguments to min; returning Inf

answered Sep 16 '22 14:09

Metrics

Related questions
                            
                                How to directly select the same column from all nested lists within a list?
                            
                                Using grep to help subset a data frame
                            
                                How to get parameters from config file in R script
                            
                                find all functions (including private) in a package
                            
                                R: `which` statement with multiple conditions
                            
                                Inconsistent results for f(g(x)) together or split up
                            
                                How can I add another layer / new series to a ggplot?
                            
                                Changing x axis tick labels in R using ggplot2 [duplicate]
                            
                                How to fill a matrix with zero (0)
                            
                                using R to copy files
                            
                                Elegantly assigning multiple columns in data.table with lapply()
                            
                                short formula call for many variables when building a model [duplicate]
                            
                                Remove strip background keep panel border
                            
                                Sum of rows based on column value
                            
                                Plot coordinates on map
                            
                                Add author affiliation in R markdown beamer presentation
                            
                                How to generate permutations or combinations of object in R?
                            
                                Useful little functions in R? [closed]
                            
                                How to generate a frequency table in R with with cumulative frequency and relative frequency
                            
                                Boxplot in R showing the mean

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

data.table row-wise sum, mean, min, max like dplyr?

Tags:

r

data.table

dplyr

Polymerase

People also ask

2 Answers

ExperimenteR

Metrics

Recent Activity

Donate For Us