Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

De-aggregate / reverse-summarise / expand a dataset in R [duplicate]

Tags:

r

dplyr

tidyr

My data looks like this:

data("Titanic")
df <- as.data.frame(Titanic)

How can I de-aggregate or reverse-summarise count/freq and expand the data set back to it's original non-count observation state?

For instance, I want 3rd, Male, Child, No repeated 35 times and 1st, Female, Adult, Yes repeated 140 times, etc, etc, in the dataframe.

Thanks in advance.

like image 598
emehex Avatar asked Jul 05 '16 16:07

emehex


4 Answers

Chiming in with another tidyr approach, using tidyr::uncount.

library(tidyverse)

original <- tibble(x = c(1,1,1,2,2,2,4,4,4))
aggregated <- original %>% count(x)
deaggregated <- aggregated %>% uncount(weights = n)
like image 158
jsavn Avatar answered Sep 28 '22 02:09

jsavn


You can do this with list columns and a few dplyr/tidyr/purrr verbs. It's not as compact as other base R solutions may be, but for me it's easier to understand how things fit together and it works in a larger tidyverse pipeflow.

As a check first-off, we're expecting 2,201 rows in the final data frame:

library(dplyr)
library(tidyr)
library(purrr)

sum(df$Freq)
#> [1] 2201

Converting this to a tibble makes it easier to see and work with the list column. I'm using purrr::map to move along the Freq column, creating a vector of a dummy marker the length of the Freq value. In this case, that marker is just "1"; it could instead be TRUE or anything else. The point it just that it will create a vector of the length Freq.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x)))
#> # A tibble: 32 x 6
#>    Class Sex    Age   Survived  Freq obs        
#>    <fct> <fct>  <fct> <fct>    <dbl> <list>     
#>  1 1st   Male   Child No           0 <dbl [0]>  
#>  2 2nd   Male   Child No           0 <dbl [0]>  
#>  3 3rd   Male   Child No          35 <dbl [35]> 
#>  4 Crew  Male   Child No           0 <dbl [0]>  
#>  5 1st   Female Child No           0 <dbl [0]>  
#>  6 2nd   Female Child No           0 <dbl [0]>  
#>  7 3rd   Female Child No          17 <dbl [17]> 
#>  8 Crew  Female Child No           0 <dbl [0]>  
#>  9 1st   Male   Adult No         118 <dbl [118]>
#> 10 2nd   Male   Adult No         154 <dbl [154]>
#> # … with 22 more rows

Then tidyr::unnest creates a row for each element in that dummy vector. After that, I drop those last 2 columns to just have the important categories of class, sex, age, and survival.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x))) %>%
  unnest() %>%
  select(-Freq, -obs)
#> # A tibble: 2,201 x 4
#>    Class Sex   Age   Survived
#>    <fct> <fct> <fct> <fct>   
#>  1 3rd   Male  Child No      
#>  2 3rd   Male  Child No      
#>  3 3rd   Male  Child No      
#>  4 3rd   Male  Child No      
#>  5 3rd   Male  Child No      
#>  6 3rd   Male  Child No      
#>  7 3rd   Male  Child No      
#>  8 3rd   Male  Child No      
#>  9 3rd   Male  Child No      
#> 10 3rd   Male  Child No      
#> # … with 2,191 more rows

In the end, it is in fact a data frame of 2,201 rows.

like image 32
camille Avatar answered Nov 20 '22 04:11

camille


Without packages we can repeat each row according to the frequencies given:

df2 <- df[rep(1:nrow(df), df[,5]),-5]
like image 5
Pierre L Avatar answered Nov 20 '22 04:11

Pierre L


You can do this using the function untable in reshape.

data("Titanic")
df <- as.data.frame(Titanic)

library(reshape)
newDf = untable(df[,1:4], num = df[,5])
like image 4
David Avatar answered Nov 20 '22 04:11

David