Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implicit sorting in tidyr::spread and dplyr::summarise

My data are ordered observations and I want to keep the ordering as much as possible while doing manipulations.

Take the answer for this question, I put "B" ahead of "A" in the dataframe. The resulting wide data are sorted by the column "name", i.e., "A" first, then "B".

df = data.frame(name=c("B","B","A","A"),
                group=c("g1","g2","g1","g2"),
                V1=c(10,40,20,30),
                V2=c(6,3,1,7))

gather(df, Var, Val, V1:V2) %>% 
unite(VarG, Var, group) %>% 
spread(VarG, Val)

  name V1_g1 V1_g2 V2_g1 V2_g2
1    A    20    30     1     7
2    B    10    40     6     3

Is there a way to keep the original ordering? like this:

  name V1_g1 V1_g2 V2_g1 V2_g2
1    B    10    40     6     3
2    A    20    30     1     7

04/02 edit: I've just found the dplyr::summarise does sorting as well. arrange(name, df$name) still works to restore the order. But I wonder if the extra sorting is necessary from the design of the packages?

df %>% 
  group_by(name) %>% 
  summarise(n()) %>% 

  name n()
1    A   2
2    B   2
like image 748
Dong Avatar asked Apr 01 '15 00:04

Dong


2 Answers

You can sort by name based on the order in the original data frame:

gather(df, Var, Val, V1:V2) %>% 
  unite(VarG, Var, group) %>% 
  spread(VarG, Val) %>%
  arrange( order(match(name, df$name)))

#   name V1_g1 V1_g2 V2_g1 V2_g2
# 1    B    10    40     6     3
# 2    A    20    30     1     7
like image 131
bergant Avatar answered Oct 05 '22 11:10

bergant


The order is taken from the order of the factor levels.

str(df)
'data.frame':   4 obs. of  4 variables:
 $ name : Factor w/ 2 levels "A","B": 2 2 1 1
 $ group: Factor w/ 2 levels "g1","g2": 1 2 1 2
 $ V1   : num  10 40 20 30
 $ V2   : num  6 3 1 7

See that the levels are "A","B".

So if you set the order of the levels to the order they are shown in it will work:

df = data.frame(name=c("B","B","A","A"),
                group=c("g1","g2","g1","g2"),
                V1=c(10,40,20,30),
                V2=c(6,3,1,7))

df %>% 
    mutate(name = factor(name,levels=unique(name))) %>% 
    mutate(group = factor(group,levels=unique(group))) %>% 
    gather(Var, Val, V1:V2) %>% 
    unite(VarG, Var, group) %>% 
    spread(VarG, Val)

Results in:

  name V1_g1 V1_g2 V2_g1 V2_g2
1    B    10    40     6     3
2    A    20    30     1     7
like image 32
Jan Stanstrup Avatar answered Oct 05 '22 12:10

Jan Stanstrup