Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translating dplyr to data.table

So I am trying to translate some dplyr code. I have tried to get help from a package that translates dplyr to data.table but it still does not work. The error is with row_number from dplyr..

I need all the steps in the dplyr code (even though they don't make sense here with mtcars)

library(dplyr)
library(dtplyr) # from https://github.com/tidyverse/dtplyr
library(data.table)

mtcars %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  group_by(am) %>% 
  arrange(mpg, .by_group = TRUE) %>% 
  mutate(row_num = LETTERS[row_number()]) %>% 
  ungroup() 

# using dtplyr
dt <- lazy_dt(mtcars)

dt %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  group_by(am) %>% 
  arrange(mpg, .by_group = TRUE) %>% 
  mutate(row_num = LETTERS[row_number()]) %>% 
  ungroup() %>% 
  show_query()
#> unique(`_DT1`, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A", 
#> "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", 
#> "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")[row_number()]), 
#>     keyby = .(am)]

# I then use the query from dtplyr 
DT <- as.data.table(mtcars)
unique(DT, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A", 
                                                              "B", "C", "D", "E", "F", "G", 
                                                              "H", "I", "J", "K", "L", "M", 
                                                              "N", "O", "P", "Q", "R", "S", 
                                                              "T", "U", "V", "W", "X", "Y", 
                                                              "Z")[row_number()]), keyby = .(am)]

#> row_number() should only be called in a data context

Created on 2019-07-14 by the reprex package (v0.3.0)

like image 881
xhr489 Avatar asked Jul 13 '19 23:07

xhr489


People also ask

Does dplyr work with data table?

dtplyr provides a data. table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data. table code.

Is dplyr faster than data table?

table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .

Is data table faster than Tidyverse?

The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.

Is dplyr faster?

dplyr , with numeric ID vectors, is faster (but less efficient) with ordering + joining.


Video Answer


1 Answers

Might I recommend the rowid function? It does the grouping step "under the hood" you might find it looks cleaner:

unique(DT, by='mpg')[order(am, mpg), row_num := LETTERS[rowid(am)]]

if you love chaining, you could also get everything inside []:

DT[ , .SD[1L], by = mpg
   ][order(am, mpg), row_num := LETTERS[rowid(am)]]
like image 149
MichaelChirico Avatar answered Oct 12 '22 08:10

MichaelChirico