When attempting to get a "count" for a specified number of observations, it appears that each of these functions work, as they derive the same result. But how are these functions operating differently in the background, and in what scenarios would it not be appropriate to swap one for the other?
sum(grade.data$Quiz >= (100*.45))
length(which(grade.data$Quiz >= (100*.45)))
nrow(grade.data[grade.data$Quiz >= (100*.45),])
nrow() function in R Language is used to return the number of rows of the specified matrix. Syntax: nrow(x) Parameters: x: matrix, vector, array or data frame.
length() function in R Programming Language is used to get or set the length of a vector (list) or other objects.
You can use the length() function in R to calculate the length of vectors, lists, and other objects. where: x: The name of the object to calculate length for.
In R programming language, to find the length of every elements in a list, the function lengths() can be used. This function loops over x and returns a compatible vector containing the length of each element in x.
The middle one will not give misleading answers when there are missing values. Both of the other ones will.
Number 1 sums a logical vector that is coerced to 1's and 0's. If you added na.rm it would be valid when NA's are present.
Number 2 determines the length of a numeric vector.
Number three constructs a subset and then counts the rows. I would expect it to be rather inefficient compared to the other two as well as having the problem with NA values. If you added & !is.na(grade.data$Quiz)
to the logical expression inside [ , ]
, you would get valid answers.
A fourth method like the third (and also inefficient) without the NA problem would be:
nrow( subset( grade.data, Quiz >= (100*.45) ) )
Let's generate 100k row data.frame to see which method is fastest.
grade.data = data.frame(Quiz = sample(100000), age = sample(18:24, 100000, replace = TRUE))
library(data.table)
dt.grade.data = as.data.table(grade.data)
The methods posted here
data.table = function(x) dt.grade.data[,sum(Quiz>=100*.45)]
logical.sum = function(x) sum(grade.data$Quiz >= (100*.45))
logical.counting.table = function(x) table(grade.data$Quiz >= (100*.45))[["TRUE"]]
logical.which = function(x) length(which(grade.data$Quiz >= (100*.45)))
subsetting = function(x) nrow(grade.data[grade.data$Quiz >= (100*.45),])
subset.cmd = function(x) nrow(subset(grade.data, Quiz >= (100*.45) ))
Benchmark
microbenchmark(data.table(), logical.sum(), logical.counting.table(), logical.pointless.which(), subsetting(), subset.cmd(), times = 100L)
Unit: microseconds
expr min lq median uq max neval
data.table() 1766.148 2188.8000 2308.267 2469.405 29185.36 100
logical.sum() 739.385 945.4765 993.921 1074.386 10253.67 100
logical.counting.table() 28867.605 30847.0290 31546.796 32725.255 65514.14 100
logical.which() 701.205 1080.9555 1138.635 1228.545 3565.96 100
subsetting() 27376.931 28406.7730 29243.866 30564.371 168034.45 100
subset.cmd() 29004.315 31203.1730 32219.878 33362.003 89801.34 100
Seems that a vectorized logical check is the fastest method. In a smaller data frame (500 rows). data.table is actually much slower than all the other methods.
edit: Apparently, relatively efficiency of logical.sum() and logical.which() depends on the data structure. Using different Quiz score distribution can make the logical.sum() the fastest method. And as expected, data.table selection/subsetting blows data.frame subsetting out of the water.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With