<h3>Overview</h3> I'm relatively familiar with <code>data.table</code>, not so much with <code>dplyr</code>. I've read through some <code>dplyr</code> vignettes and examples that have popped up on SO, and so far my conclusions are that: <ol> <li> <code>data.table</code> and <code>dplyr</code> are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)</li> <li> <code>dplyr</code> has more accessible syntax</li> <li> <code>dplyr</code> abstracts (or will) potential DB interactions</li> <li>There are some minor functionality differences (see "Examples/Usage" below)</li> </ol> In my mind 2. doesn't bear much weight because I am fairly familiar with it <code>data.table</code>, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with <code>data.table</code>. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here). <h3>Question</h3> What I want to know is: <ol> <li>Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).</li> <li>Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.</li> </ol> One recent SO question got me thinking about this a bit more, because up until that point I didn't think <code>dplyr</code> would offer much beyond what I can already do in <code>data.table</code>. Here is the <code>dplyr</code> solution (data at end of Q): <pre class="prettyprint"><code>dat %.% group_by(name, job) %.% filter(job != "Boss" | year == min(year)) %.% mutate(cumu_job2 = cumsum(job2)) </code></pre> Which was much better than my hack attempt at a <code>data.table</code> solution. That said, good <code>data.table</code> solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution): <pre class="prettyprint"><code>setDT(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by=list(id, job) ] </code></pre> The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to <code>data.table</code> (i.e. doesn't use some of the more esoteric tricks). Ideally what I'd like to see is some good examples were the <code>dplyr</code> or <code>data.table</code> way is substantially more concise or performs substantially better. <h3>Examples</h3> Usage <ul> <li> <code>dplyr</code> does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using <code>do</code> in the answer to @eddi's question).</li> <li> <code>data.table</code> supports rolling joins (thanks @dholstius) as well as overlap joins </li> <li> <code>data.table</code> internally optimises expressions of the form <code>DT[col == value]</code> or <code>DT[col %in% values]</code> for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.</li> <li> <code>dplyr</code> offers standard evaluation versions of functions (e.g. <code>regroup</code>, <code>summarize_each_</code>) that can simplify the programmatic use of <code>dplyr</code> (note programmatic use of <code>data.table</code> is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)</li> </ul> Benchmarks <ul> <li>I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point <code>data.table</code> becomes substantially faster.</li> <li>@Arun ran some benchmarks on joins, showing that <code>data.table</code> scales better than <code>dplyr</code> as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has <code>data.table</code> ~6x faster.</li> <li>(Unverified) has <code>data.table</code> 75% faster on larger versions of a group/apply/sort while <code>dplyr</code> was 40% faster on the smaller ones (another SO question from comments, thanks danas).</li> <li>Matt, the main author of <code>data.table</code>, has benchmarked grouping operations on <code>data.table</code>, <code>dplyr</code> and python <code>pandas</code> on up to 2 billion rows (~100GB in RAM).</li> <li>An older benchmark on 80K groups has <code>data.table</code> ~8x faster</li> </ul> <h3>Data</h3> This is for the first example I showed in the question section. <pre class="prettyprint"><code>dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L)) </code></pre>

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): <code>Speed</code>, <code>Memory usage</code>, <code>Syntax</code> and <code>Features</code>. My intent is to cover each one of these as clearly as possible from data.table perspective. <blockquote> Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp. </blockquote> <hr> The data.table syntax is consistent in its form - <code>DT[i, j, by]</code>. To keep <code>i</code>, <code>j</code> and <code>by</code> together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax. <h3>1. Speed</h3> Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares <code>pandas</code>. See also updated benchmarks, which include <code>Spark</code> and <code>pydatatable</code> as well. On benchmarks, it would be great to cover these remaining aspects as well: <ul> <li> Grouping operations involving a subset of rows - i.e., <code>DT[x > val, sum(y), by = z]</code> type operations. </li> <li> Benchmark other operations such as update and joins. </li> <li> Also benchmark memory footprint for each operation in addition to runtime. </li> </ul> <h3>2. Memory usage</h3> <ol> <li> Operations involving <code>filter()</code> or <code>slice()</code> in dplyr can be memory inefficient (on both data.frames and data.tables). See this post. <blockquote> Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory. </blockquote> </li> <li> data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable). <pre class="prettyprint"><code> # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] </code></pre> But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned): <pre class="prettyprint"><code> # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) </code></pre> A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it. Therefore we are working towards exporting <code>shallow()</code> function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do: <pre class="prettyprint"><code> foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } </code></pre> By not using <code>shallow()</code>, the old functionality is retained: <pre class="prettyprint"><code> bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } </code></pre> By creating a shallow copy using <code>shallow()</code>, we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties. <blockquote> Also, once <code>shallow()</code> is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables. </blockquote> <blockquote> But it will still lack many features that data.table provides, including (sub)-assignment by reference. </blockquote> </li> <li> Aggregate while joining: Suppose you have two data.tables as follows: <pre class="prettyprint"><code> DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3 </code></pre> And you would like to get <code>sum(z) * mul</code> for each row in <code>DT2</code> while joining by columns <code>x,y</code>. We can either: <ul> <li> <ol> <li> aggregate <code>DT1</code> to get <code>sum(z)</code>, 2) perform a join and 3) multiply (or) <h3>data.table way</h3> DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] <h3>dplyr equivalent</h3> DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) </li> </ol> </li> <li> <ol start="2"> <li> do it all in one go (using <code>by = .EACHI</code> feature): DT1[DT2, list(z=sum(z) * mul), by = .EACHI] </li> </ol> </li> </ul> What is the advantage? <ul> <li> We don't have to allocate memory for the intermediate result. </li> <li> We don't have to group/hash twice (one for aggregation and other for joining). </li> <li> And more importantly, the operation what we wanted to perform is clear by looking at <code>j</code> in (2). </li> </ul> Check this post for a detailed explanation of <code>by = .EACHI</code>. No intermediate results are materialised, and the join+aggregate is performed all in one go. Have a look at this, this and this posts for real usage scenarios. In <code>dplyr</code> you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed). </li> <li> Update and joins: Consider the data.table code shown below: <pre class="prettyprint"><code> DT1[DT2, col := i.mul] </code></pre> adds/updates <code>DT1</code>'s column <code>col</code> with <code>mul</code> from <code>DT2</code> on those rows where <code>DT2</code>'s key column matches <code>DT1</code>. I don't think there is an exact equivalent of this operation in <code>dplyr</code>, i.e., without avoiding a <code>*_join</code> operation, which would have to copy the entire <code>DT1</code> just to add a new column to it, which is unnecessary. Check this post for a real usage scenario. </li> </ol> <blockquote> To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds! </blockquote> <h3>3. Syntax</h3> Let's now look at syntax. Hadley commented here: <blockquote> Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ... </blockquote> I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side. We will work with the dummy data shown below: <pre class="prettyprint"><code>DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5)) DF = as.data.frame(DT) </code></pre> <ol> <li> Basic aggregation/update operations. <pre class="prettyprint"><code> # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) </code></pre> <ul> <li> data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a). </li> <li> In case (b), we had to use <code>filter()</code> in dplyr while summarising. But while updating, we had to move the logic inside <code>mutate()</code>. In data.table however, we express both operations with the same logic - operate on rows where <code>x > 2</code>, but in first case, get <code>sum(y)</code>, whereas in the second case update those rows for <code>y</code> with its cumulative sum. This is what we mean when we say the <code>DT[i, j, by]</code> form is consistent. </li> <li> Similarly in case (c), when we have <code>if-else</code> condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the <code>if</code> condition satisfies and skip otherwise, we cannot use <code>summarise()</code> directly (AFAICT). We have to <code>filter()</code> first and then summarise because <code>summarise()</code> always expects a single value. While it returns the same result, using <code>filter()</code> here makes the actual operation less obvious. It might very well be possible to use <code>filter()</code> in the first case as well (does not seem obvious to me), but my point is that we should not have to. </li> </ul> </li> <li> Aggregation / update on multiple columns <pre class="prettyprint"><code> # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) </code></pre> <ul> <li> In case (a), the codes are more or less equivalent. data.table uses familiar base function <code>lapply()</code>, whereas <code>dplyr</code> introduces <code>*_each()</code> along with a bunch of functions to <code>funs()</code>. </li> <li> data.table's <code>:=</code> requires column names to be provided, whereas dplyr generates it automatically. </li> <li> In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list. </li> <li> In case (c) though, dplyr would return <code>n()</code> as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in <code>j</code>. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function <code>c()</code> to concatenate <code>.N</code> to a <code>list</code> which returns a <code>list</code>. </li> </ul> <blockquote> Note: Once again, in data.table, all we need to do is return a list in <code>j</code>. Each element of the list will become a column in result. You can use <code>c()</code>, <code>as.list()</code>, <code>lapply()</code>, <code>list()</code> etc... base functions to accomplish this, without having to learn any new functions. </blockquote> <blockquote> You will need to learn just the special variables - <code>.N</code> and <code>.SD</code> at least. The equivalent in dplyr are <code>n()</code> and <code>.</code> </blockquote> </li> <li> Joins dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax <code>DT[i, j, by]</code> (and with reason). It also provides an equivalent <code>merge.data.table()</code> function as an alternative. <pre class="prettyprint"><code> setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ?? </code></pre> </li> </ol> <ul> <li> Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's <code>DT[i, j, by]</code>, or <code>merge()</code> which is similar to base R. </li> <li> However dplyr joins do just that. Nothing more. Nothing less. </li> <li> data.tables can select columns while joining (2), and in dplyr you will need to <code>select()</code> first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient. </li> <li> data.tables can aggregate while joining using <code>by = .EACHI</code> feature (3) and also update while joining (4). Why materialize the entire join result to add/update just a few columns? </li> <li> data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest. </li> <li> data.table also has <code>mult =</code> argument which selects first, last or all matches (6). </li> <li> data.table has <code>allow.cartesian = TRUE</code> argument to protect from accidental invalid joins. </li> </ul> <blockquote> Once again, the syntax is consistent with <code>DT[i, j, by]</code> with additional arguments allowing for controlling the output further. </blockquote> <ol start="4"> <li> <code>do()</code>... dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to <code>do()</code>. You have to know beforehand about all your functions return value. <pre class="prettyprint"><code> DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x)))) </code></pre> </li> </ol> <ul> <li> <code>.SD</code>'s equivalent is <code>.</code> </li> <li> In data.table, you can throw pretty much anything in <code>j</code> - the only thing to remember is for it to return a list so that each element of the list gets converted to a column. </li> <li> In dplyr, cannot do that. Have to resort to <code>do()</code> depending on how sure you are as to whether your function would always return a single value. And it is quite slow. </li> </ul> <blockquote> Once again, data.table's syntax is consistent with <code>DT[i, j, by]</code>. We can just keep throwing expressions in <code>j</code> without having to worry about these things. </blockquote> Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax... <blockquote> To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it. </blockquote> <blockquote> data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here. </blockquote> <blockquote> But one should also consider the number of features that dplyr lacks in comparison to data.table. </blockquote> <h3>4. Features</h3> I have pointed out most of the features here and also in this post. In addition: <ul> <li> fread - fast file reader has been available for a long time now. </li> <li> fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments. </li> <li> Automatic indexing - another handy feature to optimise base R syntax as is, internally. </li> <li> Ad-hoc grouping: <code>dplyr</code> automatically sorts the results by grouping variables during <code>summarise()</code>, which may not be always desirable. </li> <li> Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above. </li> <li> Non-equi joins: Allows joins using other operators <code><=, <, >, >=</code> along with all other advantages of data.table joins. </li> <li> Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks. </li> <li> <code>setorder()</code> function in data.table that allows really fast reordering of data.tables by reference. </li> <li> dplyr provides interface to databases using the same syntax, which data.table does not at the moment. </li> <li> <code>data.table</code> provides faster equivalents of set operations (written by Jan Gorecki) - <code>fsetdiff</code>, <code>fintersect</code>, <code>funion</code> and <code>fsetequal</code> with additional <code>all</code> argument (as in SQL). </li> <li> data.table loads cleanly with no masking warnings and has a mechanism described here for <code>[.data.frame</code> compatibility when passed to any R package. dplyr changes base functions <code>filter</code>, <code>lag</code> and <code>[</code> which can cause problems; e.g. here and here. </li> </ul> <hr> Finally: <ul> <li> On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure. </li> <li> On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe). <ul> <li>Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using <code>OpenMP</code>.</li> </ul> </li> </ul>

Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities). <h3>Syntax</h3> There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem! One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation. You create complexity by piping these simple operations together with <code>%>%</code>. Here's an example from one of the posts Arun <a href="https://stackoverflow.com/questions/27511604/">linked to</a>: <pre class="prettyprint"><code>diamonds %>% filter(cut != "Fair") %>% group_by(cut) %>% summarize( AvgPrice = mean(price), MedianPrice = as.numeric(median(price)), Count = n() ) %>% arrange(desc(Count)) </code></pre> Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. The disadvantage of English verbs is that they require more typing than <code>[</code>, but I think that can be largely mitigated by better autocomplete. Here's the equivalent data.table code: <pre class="prettyprint"><code>diamondsDT <- data.table(diamonds) diamondsDT[ cut != "Fair", .(AvgPrice = mean(price), MedianPrice = as.numeric(median(price)), Count = .N ), by = cut ][ order(-Count) ] </code></pre> It's harder to follow this code unless you're already familiar with data.table. (I also couldn't figure out how to indent the repeated <code>[</code> in a way that looks good to my eye). Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code. Two other minor factors that I think slightly decrease readability: <ul> <li>Since almost every data table operation uses <code>[</code> you need additional context to figure out what's happening. For example, is <code>x[y]</code> joining two data tables or extracting columns from a data frame? This is only a small issue, because in well-written code the variable names should suggest what's happening.</li> <li>I like that <code>group_by()</code> is a separate operation in dplyr. It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot <code>group_by()</code> than the <code>by</code> argument to <code>[.data.table</code>.</li> </ul> I also like that the the pipe isn't just limited to just one package. You can start by tidying your data with tidyr, and finish up with a plot in ggvis. And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. In fact, I rather prefer the previous data.table code rewritten with <code>%>%</code>: <pre class="prettyprint"><code>diamonds %>% data.table() %>% .[cut != "Fair", .(AvgPrice = mean(price), MedianPrice = as.numeric(median(price)), Count = .N ), by = cut ] %>% .[order(-Count)] </code></pre> And the idea of piping with <code>%>%</code> is not limited to just data frames and is easily generalised to other contexts: <a href="http://rstudio.github.io/dygraphs/" rel="noreferrer">interactive web graphics</a>, <a href="https://github.com/hadley/rvest" rel="noreferrer">web scraping</a>, gists, <a href="https://github.com/smbache/ensurer" rel="noreferrer">run-time contracts</a>, ...) <h3>Memory and performance</h3> I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data. The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend. All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores. <h3>Features</h3> A few things that we're planning to work on in 2015: <ul> <li>the <code>readr</code> package, to make it easy to get files off disk and in to memory, analogous to <code>fread()</code>.</li> <li>More flexible joins, including support for non-equi-joins.</li> <li>More flexible grouping like bootstrap samples, rollups and more</li> </ul> I'm also investing time into improving R's <a href="https://github.com/rstats-db/DBI" rel="noreferrer">database connectors</a>, the ability to talk to web apis, and making it easier to scrape html pages.

data.table vs dplyr: can one do something well the other can't or does poorly?

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyr has more accessible syntax
dplyr abstracts (or will) potential DB interactions
There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%   group_by(name, job) %.%   filter(job != "Boss" | year == min(year)) %.%   mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,   .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],    by=list(id, job) ]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage

dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
data.table supports rolling joins (thanks @dholstius) as well as overlap joins
data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

Benchmarks

I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster.
@Arun ran some benchmarks on joins, showing that data.table scales better than dplyr as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has data.table ~6x faster.
(Unverified) has data.table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas).
Matt, the main author of data.table, has benchmarked grouping operations on data.table, dplyr and python pandas on up to 2 billion rows (~100GB in RAM).
An older benchmark on 80K groups has data.table ~8x faster

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,  2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",  "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",  "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,  1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,  1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",  "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",  "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,  1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",  "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,  -16L))

200

asked Jan 29 '14 15:01

BrodieG

2 Answers

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.
Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
```
 # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] 
```
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
```
 # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) 
```
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:
```
 foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } 
```
By not using shallow(), the old functionality is retained:
```
 bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } 
```
By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:

Suppose you have two data.tables as follows:
```
 DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))  #    x y z  # 1: 1 a 1  # 2: 1 a 2  # 3: 1 b 3  # 4: 1 b 4  # 5: 2 a 5  # 6: 2 a 6  # 7: 2 b 7  # 8: 2 b 8  DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))  #    x y mul  # 1: 1 a   4  # 2: 2 b   3 
```
And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:
- 1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)
    
    data.table way
    
    DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
    
    dplyr equivalent
    
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)
- 1. do it all in one go (using by = .EACHI feature):
    
    DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
- We don't have to allocate memory for the intermediate result.
- We don't have to group/hash twice (one for aggregation and other for joining).
- And more importantly, the operation what we wanted to perform is clear by looking at j in (2).
Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

Have a look at this, this and this posts for real usage scenarios.

In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).
Update and joins:

Consider the data.table code shown below:
```
 DT1[DT2, col := i.mul] 
```
adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5)) DF = as.data.frame(DT)

Basic aggregation/update operations.
```
 # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) 
```
- data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
- In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.
 
 This is what we mean when we say the DT[i, j, by] form is consistent.
- Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.
 
 While it returns the same result, using filter() here makes the actual operation less obvious.
 
 It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
```
 # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) 
```
- In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().
- data.table's := requires column names to be provided, whereas dplyr generates it automatically.
- In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
- In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.
Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

Joins

dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

 setkey(DT1, x, y)   # 1. normal join  DT1[DT2]            ## data.table syntax  left_join(DT2, DT1) ## dplyr syntax   # 2. select columns while join      DT1[DT2, .(z, i.mul)]  left_join(select(DT2, x, y, mul), select(DT1, x, y, z))   # 3. aggregate while join  DT1[DT2, .(sum(z) * i.mul), by = .EACHI]  DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%       inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)   # 4. update while join  DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]  ??   # 5. rolling join  DT1[DT2, roll = -Inf]  ??   # 6. other arguments to control output  DT1[DT2, mult = "first"]  ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining using by = .EACHI feature (3) and also update while joining (4). Why materialize the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult = argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

do()...

dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

 DT[, list(x[1], y[1]), by = z]                 ## data.table syntax  DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax  DT[, list(x[1:2], y[1]), by = z]  DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))   DT[, quantile(x, 0.25), by = z]  DF %>% group_by(z) %>% summarise(quantile(x, 0.25))  DT[, quantile(x, c(0.25, 0.75)), by = z]  DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))   DT[, as.list(summary(x)), by = z]  DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

.SD's equivalent is .
In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.
In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.
Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.
Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder() function in data.table that allows really fast reordering of data.tables by reference.
dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).
data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.

Finally:

On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

187

answered Sep 28 '22 10:09

29 revs, 6 users 93%

Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities).

Syntax

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!

One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation.

You create complexity by piping these simple operations together with %>%. Here's an example from one of the posts Arun linked to:

diamonds %>%   filter(cut != "Fair") %>%   group_by(cut) %>%   summarize(     AvgPrice = mean(price),     MedianPrice = as.numeric(median(price)),     Count = n()   ) %>%   arrange(desc(Count))

Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. The disadvantage of English verbs is that they require more typing than [, but I think that can be largely mitigated by better autocomplete.

Here's the equivalent data.table code:

diamondsDT <- data.table(diamonds) diamondsDT[   cut != "Fair",    .(AvgPrice = mean(price),     MedianPrice = as.numeric(median(price)),     Count = .N   ),    by = cut ][    order(-Count)  ]

It's harder to follow this code unless you're already familiar with data.table. (I also couldn't figure out how to indent the repeated [ in a way that looks good to my eye). Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code.

Two other minor factors that I think slightly decrease readability:

Since almost every data table operation uses [ you need additional context to figure out what's happening. For example, is x[y] joining two data tables or extracting columns from a data frame? This is only a small issue, because in well-written code the variable names should suggest what's happening.
I like that group_by() is a separate operation in dplyr. It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table.

I also like that the the pipe isn't just limited to just one package. You can start by tidying your data with tidyr, and finish up with a plot in ggvis. And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. In fact, I rather prefer the previous data.table code rewritten with %>%:

diamonds %>%    data.table() %>%    .[cut != "Fair",      .(AvgPrice = mean(price),       MedianPrice = as.numeric(median(price)),       Count = .N     ),      by = cut   ] %>%    .[order(-Count)]

And the idea of piping with %>% is not limited to just data frames and is easily generalised to other contexts: interactive web graphics, web scraping, gists, run-time contracts, ...)

Memory and performance

I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data.

The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend.

All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores.

Features

A few things that we're planning to work on in 2015:

the readr package, to make it easy to get files off disk and in to memory, analogous to fread().
More flexible joins, including support for non-equi-joins.
More flexible grouping like bootstrap samples, rollups and more

I'm also investing time into improving R's database connectors, the ability to talk to web apis, and making it easier to scrape html pages.

answered Sep 28 '22 10:09

hadley

Related questions
                            
                                Changing column names of a data frame
                            
                                How to find the statistical mode?
                            
                                Counting the number of elements with the values of x in a vector
                            
                                How to find out which package version is loaded in R?
                            
                                Tricks to manage the available memory in an R session
                            
                                Quickly reading very large tables as dataframes
                            
                                Create an empty data.frame
                            
                                Run R script from command line
                            
                                Drop unused factor levels in a subsetted data frame
                            
                                Test if a vector contains a given element
                            
                                Convert a list to a data frame
                            
                                The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
                            
                                How to unload a package without restarting R
                            
                                What is the difference between require() and library()?
                            
                                How can I view the source code for a function?
                            
                                How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning?
                            
                                Plot two graphs in same plot in R
                            
                                How to convert a factor to integer\numeric without loss of information?
                            
                                How can we make xkcd style graphs?
                            
                                Rotating and spacing axis labels in ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With