It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other? Example: <pre class="prettyprint"><code>library(dplyr) df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5) summary(df1$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 summary(df2$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 </code></pre>

They are, indeed, producing the same result, and they are very similar in concept. The advantage of <code>subset</code> is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than <code>filter</code> (6 times faster in your example, but that's measured in microseconds). As the data sets grow, <code>filter</code> seems gains the upper hand in efficiency. At 15,000 records, <code>filter</code> outpaces <code>subset</code> by about 300 microseconds. And at 153,000 records, <code>filter</code> is three times faster (measured in milliseconds). So in terms of human time, I don't think there's much difference between the two. The other advantage (and this is a bit of a niche advantage) is that <code>filter</code> can operate on SQL databases without pulling the data into memory. <code>subset</code> simply doesn't do that. Personally, I tend to use <code>filter</code>, but only because I'm already using the <code>dplyr</code> framework. If you aren't working with out-of-memory data, it won't make much of a difference. <pre class="prettyprint"><code>library(dplyr) library(microbenchmark) # Original example microbenchmark( df1<-subset(airquality, Temp>80 & Month > 5), df2<-filter(airquality, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b # 15,300 rows air <- lapply(1:100, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a # 153,000 rows air <- lapply(1:1000, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: milliseconds expr min lq mean median uq max neval cld subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a </code></pre>

Difference between subset and filter from dplyr

Tags:

r

filter

subset

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?

Example:

library(dplyr)  df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5)  summary(df1$Ozone) # Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's  # 9.00   39.00   64.00   64.51   84.00  168.00      14   summary(df2$Ozone) # Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's  # 9.00   39.00   64.00   64.51   84.00  168.00      14

642

asked Oct 05 '16 19:10

Ruthger Righart

2 Answers

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr) library(microbenchmark)  # Original example microbenchmark(   df1<-subset(airquality, Temp>80 & Month > 5),   df2<-filter(airquality, Temp>80 & Month > 5) )  Unit: microseconds    expr     min       lq     mean   median      uq      max neval cld  subset  95.598 107.7670 118.5236 119.9370 125.949  167.443   100  a   filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997   100   b   # 15,300 rows air <- lapply(1:100, function(x) airquality) %>% bind_rows  microbenchmark(   df1<-subset(air, Temp>80 & Month > 5),   df2<-filter(air, Temp>80 & Month > 5) )  Unit: microseconds    expr      min        lq     mean   median       uq      max neval cld  subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392   100   b  filter  968.586  985.4475 1056.686 1023.862 1036.765 2489.644   100  a   # 153,000 rows air <- lapply(1:1000, function(x) airquality) %>% bind_rows  microbenchmark(   df1<-subset(air, Temp>80 & Month > 5),   df2<-filter(air, Temp>80 & Month > 5) )  Unit: milliseconds    expr       min        lq     mean    median        uq      max neval cld  subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659   100   b  filter  5.046148  5.169164 10.27829  5.387484  6.738167 65.38937   100  a

125

answered Sep 28 '22 19:09

Benjamin

One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:

filter(mtcars, gear == 5)    mpg    cyl   disp      hp  drat wt    qsec  vs am   gear carb 1 26.0   4     120.3     91  4.43 2.140 16.7  0  1    5    2 2 30.4   4     95.1      113 3.77 1.513 16.9  1  1    5    2 3 15.8   4     351.0     264 4.22 3.170 14.5  0  1    5    4 4 19.7   4     145.0     175 3.62 2.770 15.5  0  1    5    6 5 15.0   4     301.0     335 3.54 3.570 14.6  0  1    5    8  subset(mtcars, gear == 5)                mpg    cyl   disp      hp  drat wt    qsec vs  am   gear carb Porsche 914-2  26.0   4     120.3     91  4.43 2.140 16.7  0  1    5    2 Lotus Europa   30.4   4     95.1      113 3.77 1.513 16.9  1  1    5    2 Ford Pantera L 15.8   4     351.0     264 4.22 3.170 14.5  0  1    5    4 Ferrari Dino   19.7   4     145.0     175 3.62 2.770 15.5  0  1    5    6 Maserati Bora  15.0   4     301.0     335 3.54 3.570 14.6  0  1    5    8

answered Sep 28 '22 19:09

rsmith54

Related questions
                            
                                Weird characters added to first column name after reading a toad-exported csv file
                            
                                How to check if entire vector has no values other than NA (or NAN) in R?
                            
                                R Shiny - add tabPanel to tabsetPanel dynamically (with the use of renderUI)
                            
                                Count number of rows matching a criteria
                            
                                How do I change the NA color from gray to white in a ggplot choropleth map?
                            
                                Find which season a particular date belongs to
                            
                                OS X package installation depends on gfortran-4.8
                            
                                How to remove repeated elements in a vector, similar to 'set' in Python
                            
                                What does %*% mean in R [duplicate]
                            
                                customize ggplot2 axis labels with different colors
                            
                                Merging multiple data.tables
                            
                                How to adjust the size of y axis labels only in R?
                            
                                How to check the OS within R [duplicate]
                            
                                Confusing error in R: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 42 elements) [duplicate]
                            
                                as.numeric with comma decimal separators?
                            
                                Unable to update R packages in default library on Windows 7
                            
                                Error converting text to lowercase with tm_map(..., tolower)
                            
                                How to initialize a vector with fixed length in R
                            
                                How to add whitespace to an RMarkdown document?
                            
                                How to debug "contrasts can be applied only to factors with 2 or more levels" error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With