Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r tidyverse spread() using multiple key value pairs not collapsing rows

I am trying to spread() a couple of key/value pairs but the common value column does not collapse. I think that it may have to do with some previous processing, or more likely I do not know the right way to spread two or more key/value pairs to get the result I expect.

I'm starting with this data set:

library(tidyverse)

df <- tibble(order = 1:7,
             line_1 = c(23,8,21,45,68,31,24),
             line_2 = c(63,25,25,24,48,24,63),
             line_3 = c(62,12,10,56,67,25,35))

There are 2 pre-spread steps to define order of the "count" values created in the following gather() function. This is the first pre-spread step to define the original order of the "count" variable using the row number:

ntrl <- df %>%
           gather(line_1,
                  line_2,
                  line_3,
                  key = "sector",
                  value = "count") %>%
           group_by(order) %>%
           mutate(sector_ord = row_number()) %>%
           arrange(order,
                   sector)

This is the second pre-spread step to define the numerical order of the "count" variable:

ord <- ntrl %>%
            arrange(order,
                    count) %>%
            group_by(order) %>%
            mutate(num_ord = paste0("ord_",
                                    row_number(),
                                    sep=""))

And then finally the spread code that I have been using:

wide <- ord %>%
            group_by(order) %>%
            spread(key = sector,
                   value = count) %>%
            spread(key = num_ord,
                   value = sector_ord)

What I'm getting is this:

    order   line_1  line_2  line_3  ord_1   ord_2   ord_3                           
1   1       23      NA      NA      1       NA      NA
2   1       NA      63      NA      NA      NA      2
3   1       NA      NA      62      NA      3       NA
4   2       8       NA      NA      1       NA      NA
5   2       NA      25      NA      NA      NA      2
6   2       NA      NA      12      NA      3       NA
7   3       21      NA      NA      NA      1       NA
8   3       NA      25      NA      NA      NA      2
9   3       NA      NA      10      3       NA      NA
... and so on thru 21 lines accounting for all 7 "order" lines

The behavior that I am expecting is that the "order" column would collapse in all rows that are the same "order" value to give the following:

    order   line_1  line_2  line_3  ord_1   ord_2   ord_3                           
1   1       23      63      62      1       3       2
2   2       8       25      12      1       3       2
3   3       21      25      10      2       3       1
4   4       45      24      56      2       1       3
... and so on, I think that paints the picture

I have reviewed the questions and answers about spreading with duplicate identifiers and the use of the index of row numbers but that does not help.

I figure that it has something to do with the double spreading, but I cannot figure out how to do that.

Thanks for your help.

like image 399
Austin Overman Avatar asked Oct 08 '17 01:10

Austin Overman


2 Answers

A solution using tidyverse starting your df. The key is to use summarise_all(funs(.[which(!is.na(.))])) to select the only non-NA value for each column.

library(tidyverse)

df2 <- df %>%
  gather(Lines, Value, -order) %>%
  group_by(order) %>%
  mutate(Rank = dense_rank(Value), 
         RankOrder = paste0("ord_", row_number())) %>%
  spread(Lines, Value) %>%
  spread(RankOrder, Rank) %>%
  summarise_all(funs(.[which(!is.na(.))]))
df2
# A tibble: 7 x 7
  order line_1 line_2 line_3 ord_1 ord_2 ord_3
  <int>  <dbl>  <dbl>  <dbl> <int> <int> <int>
1     1     23     63     62     1     3     2
2     2      8     25     12     1     3     2
3     3     21     25     10     2     3     1
4     4     45     24     56     2     1     3
5     5     68     48     67     3     1     2
6     6     31     24     25     3     1     2
7     7     24     63     35     1     3     2
like image 72
www Avatar answered Sep 27 '22 23:09

www


Starting from df:

df %>% 
    gather(headers, line, -order) %>% 
    separate(headers, into = c('dummy', 'rn')) %>% 
    select(-dummy) %>% 
    group_by(order) %>% 
    mutate(ord = rank(line, ties.method='first')) %>% 
    {data.table::dcast(setDT(.), order ~ rn, value.var = c("line", "ord"))}

#   order line_1 line_2 line_3 ord_1 ord_2 ord_3
#1:     1     23     63     62     1     3     2
#2:     2      8     25     12     1     3     2
#3:     3     21     25     10     2     3     1
#4:     4     45     24     56     2     1     3
#5:     5     68     48     67     3     1     2
#6:     6     31     24     25     3     1     2
#7:     7     24     63     35     1     3     2
like image 41
Psidom Avatar answered Sep 27 '22 23:09

Psidom