So I am trying to translate some dplyr code. I have tried to get help from a package that translates dplyr to data.table but it still does not work. The error is with row_number
from dplyr
..
I need all the steps in the dplyr
code (even though they don't make sense here with mtcars
)
library(dplyr)
library(dtplyr) # from https://github.com/tidyverse/dtplyr
library(data.table)
mtcars %>%
distinct(mpg, .keep_all = TRUE) %>%
group_by(am) %>%
arrange(mpg, .by_group = TRUE) %>%
mutate(row_num = LETTERS[row_number()]) %>%
ungroup()
# using dtplyr
dt <- lazy_dt(mtcars)
dt %>%
distinct(mpg, .keep_all = TRUE) %>%
group_by(am) %>%
arrange(mpg, .by_group = TRUE) %>%
mutate(row_num = LETTERS[row_number()]) %>%
ungroup() %>%
show_query()
#> unique(`_DT1`, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A",
#> "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
#> "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")[row_number()]),
#> keyby = .(am)]
# I then use the query from dtplyr
DT <- as.data.table(mtcars)
unique(DT, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A",
"B", "C", "D", "E", "F", "G",
"H", "I", "J", "K", "L", "M",
"N", "O", "P", "Q", "R", "S",
"T", "U", "V", "W", "X", "Y",
"Z")[row_number()]), keyby = .(am)]
#> row_number() should only be called in a data context
Created on 2019-07-14 by the reprex package (v0.3.0)
dtplyr provides a data. table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data. table code.
table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .
The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.
dplyr , with numeric ID vectors, is faster (but less efficient) with ordering + joining.
Might I recommend the rowid function? It does the grouping step "under the hood" you might find it looks cleaner:
unique(DT, by='mpg')[order(am, mpg), row_num := LETTERS[rowid(am)]]
if you love chaining, you could also get everything inside []
:
DT[ , .SD[1L], by = mpg
][order(am, mpg), row_num := LETTERS[rowid(am)]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With