Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can I mutate in dplyr without losing order?

Using data.table I can do the following:

library(data.table)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
#   a  b
#1: 1  1
#2: 2  2
#3: 1 NA
#4: 2 NA

dt[, b := b[1], by = a]
#   a b
#1: 1 1
#2: 2 2
#3: 1 1
#4: 2 2

Attempting the same operation in dplyr however the data gets scrambled/sorted by a:

library(dplyr)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
dt %.% group_by(a) %.% mutate(b = b[1])
#  a b
#1 1 1
#2 1 1
#3 2 2
#4 2 2

(as an aside the above also sorts the original dt, which is somewhat confusing for me given dplyr's philosophy of not modifying in place - I'm guessing that's a bug with how dplyr interfaces with data.table)

What's the dplyr way of achieving the above?

like image 654
eddi Avatar asked Feb 12 '14 00:02

eddi


People also ask

What does mutate in dplyr do?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.

How do you arrange in ascending order in dplyr?

By default, dplyr arrange() function orders in ascending order however, you can change this in R and arrange the dataframe in descending/decreasing order by using desc() function.

Is dplyr faster than base R?

In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.

What does arrange () do in R?

The arrange() function lets you reorder the rows of a tibble. It takes a tibble, followed by the unquoted names of columns. For example, to sort in ascending order of the values of column x , then (where there is a tie in x ) by descending order of values of y , you would write the following.


1 Answers

In the current development version of dplyr (which will eventually become dplyr 0.2) the behaviour differs between data frames and data tables:

library(dplyr)
library(data.table)

df <- data.frame(a = 1:2, b = c(1,2,NA,NA))
dt <- data.table(df)

df %.% group_by(a) %.% mutate(b = b[1])

## Source: local data frame [4 x 2]
## Groups: a
## 
##   a b
## 1 1 1
## 2 2 2
## 3 1 1
## 4 2 2

dt %.% group_by(a) %.% mutate(b = b[1])

## Source: local data table [4 x 2]
## Groups: a
## 
##   a b
## 1 1 1
## 2 1 1
## 3 2 2
## 4 2 2

This happens because group_by() applied to a data.table automatically does setkey() on the assumption that the index will make future operations faster.

If there's a strong feeling that this is a bad default, I'm happy to change it.

like image 154
hadley Avatar answered Nov 15 '22 08:11

hadley