Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between sum(), length(which()), and nrow() in R

Tags:

r

When attempting to get a "count" for a specified number of observations, it appears that each of these functions work, as they derive the same result. But how are these functions operating differently in the background, and in what scenarios would it not be appropriate to swap one for the other?

sum(grade.data$Quiz >= (100*.45))

length(which(grade.data$Quiz >= (100*.45)))

nrow(grade.data[grade.data$Quiz >= (100*.45),])
like image 908
SocraticDatum Avatar asked Jul 31 '14 17:07

SocraticDatum


People also ask

What is NROW function in R?

nrow() function in R Language is used to return the number of rows of the specified matrix. Syntax: nrow(x) Parameters: x: matrix, vector, array or data frame.

What does the length function in R tell you?

length() function in R Programming Language is used to get or set the length of a vector (list) or other objects.

How do I find the length of a column in R?

You can use the length() function in R to calculate the length of vectors, lists, and other objects. where: x: The name of the object to calculate length for.

How do I find the length of a table in R?

In R programming language, to find the length of every elements in a list, the function lengths() can be used. This function loops over x and returns a compatible vector containing the length of each element in x.


2 Answers

The middle one will not give misleading answers when there are missing values. Both of the other ones will.

Number 1 sums a logical vector that is coerced to 1's and 0's. If you added na.rm it would be valid when NA's are present.

Number 2 determines the length of a numeric vector.

Number three constructs a subset and then counts the rows. I would expect it to be rather inefficient compared to the other two as well as having the problem with NA values. If you added & !is.na(grade.data$Quiz) to the logical expression inside [ , ], you would get valid answers.

A fourth method like the third (and also inefficient) without the NA problem would be:

nrow( subset( grade.data, Quiz >= (100*.45) ) )
like image 116
IRTFM Avatar answered Oct 03 '22 03:10

IRTFM


Let's generate 100k row data.frame to see which method is fastest.

grade.data = data.frame(Quiz = sample(100000), age = sample(18:24, 100000, replace = TRUE))
library(data.table)
dt.grade.data = as.data.table(grade.data)

The methods posted here

data.table = function(x) dt.grade.data[,sum(Quiz>=100*.45)]
logical.sum = function(x) sum(grade.data$Quiz >= (100*.45))
logical.counting.table = function(x) table(grade.data$Quiz >= (100*.45))[["TRUE"]]
logical.which = function(x) length(which(grade.data$Quiz >= (100*.45)))
subsetting = function(x) nrow(grade.data[grade.data$Quiz >= (100*.45),])
subset.cmd = function(x) nrow(subset(grade.data, Quiz >= (100*.45) ))

Benchmark

microbenchmark(data.table(), logical.sum(), logical.counting.table(), logical.pointless.which(), subsetting(), subset.cmd(), times = 100L)

Unit: microseconds
                   expr       min       lq         median    uq         max        neval
            data.table()      1766.148  2188.8000  2308.267  2469.405  29185.36   100
           logical.sum()       739.385   945.4765   993.921  1074.386  10253.67   100
logical.counting.table()     28867.605 30847.0290 31546.796 32725.255  65514.14   100
         logical.which()       701.205  1080.9555  1138.635  1228.545   3565.96   100
            subsetting()     27376.931 28406.7730 29243.866 30564.371 168034.45   100
            subset.cmd()     29004.315 31203.1730 32219.878 33362.003  89801.34   100

Seems that a vectorized logical check is the fastest method. In a smaller data frame (500 rows). data.table is actually much slower than all the other methods.

edit: Apparently, relatively efficiency of logical.sum() and logical.which() depends on the data structure. Using different Quiz score distribution can make the logical.sum() the fastest method. And as expected, data.table selection/subsetting blows data.frame subsetting out of the water.

like image 29
Vlo Avatar answered Oct 03 '22 04:10

Vlo