Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separate collapsed column into multiple rows, preserving grouping information

Tags:

dataframe

r

tidyr

I have a messy table I read into a data frame similar to the following simplified one, which I want to parse.

mydf <- data.frame(group=LETTERS[1:5], code=101:105, ids=c('g1:id1,id2,id3\ng2:id4,id5',
                                                           'id6,id7,id8,id9',
                                                           'g1:id10,id11\ng3:id12',
                                                           'g2:id13,id14',
                                                           'id15'))

It looks like:

> mydf
  group code                        ids
1     A  101 g1:id1,id2,id3\ng2:id4,id5
2     B  102            id6,id7,id8,id9
3     C  103      g1:id10,id11\ng3:id12
4     D  104               g2:id13,id14
5     E  105                       id15

I need to separate the collapsed ids column into multiple rows, to obtain a final data frame into long format.

For this, I would usually apply separate_rows from tidyr, but this case has an additional problem: some of the collapsed ids are grouped into subgroups.

I guess I should first separate those subgroups into different ids columns, and then apply separate_rows or separate_longer_delim on those columns... I do not know how to go about this.

The final data frame I expect would be like this, in long format, with the subgroup information in an additional column:

> mydf_new
   group code   id subgroup
1      A  101  id1       g1
2      A  101  id2       g1
3      A  101  id3       g1
4      A  101  id4       g2
5      A  101  id5       g2
6      B  102  id6     <NA>
7      B  102  id7     <NA>
8      B  102  id8     <NA>
9      B  102  id9     <NA>
10     C  103 id10       g1
11     C  103 id11       g1
12     C  103 id12       g3
13     D  104 id13       g2
14     D  104 id14       g2
15     E  105 id15     <NA>
like image 767
DaniCee Avatar asked Aug 31 '25 20:08

DaniCee


2 Answers

One way is to separate wider / longer across a few steps:

library(tidyr)
library(dplyr)

mydf |> 
  separate_longer_delim(ids, "\n") |> 
  separate_wider_delim(ids, ":", names = c("subgroup", "id"), too_few = "align_end") |> 
  separate_longer_delim(id, ",") |> 
  relocate(subgroup, .after = last_col())

# A tibble: 15 × 4
   group  code id    subgroup
   <chr> <int> <chr> <chr>   
 1 A       101 id1   g1      
 2 A       101 id2   g1      
 3 A       101 id3   g1      
 4 A       101 id4   g2      
 5 A       101 id5   g2      
 6 B       102 id6   NA      
 7 B       102 id7   NA      
 8 B       102 id8   NA      
 9 B       102 id9   NA      
10 C       103 id10  g1      
11 C       103 id11  g1      
12 C       103 id12  g3      
13 D       104 id13  g2      
14 D       104 id14  g2      
15 E       105 id15  NA    
like image 73
Ritchie Sacramento Avatar answered Sep 04 '25 04:09

Ritchie Sacramento


This might include some tweaks you can use to develope a more consice base R version if you like:

id = read.table(text=mydf$ids, sep="\n", header=FALSE)
g = sub("\\:.*", "", id$V1); g[nchar(g)>2L]=NA
cbind(mydf[-3L][rep(seq(nrow(mydf)), sapply(mydf$ids, \(x) sum(utf8ToInt(x)==utf8ToInt("i")))), ], 
      id=unlist(i<-strsplit(sub(".*\\:", "", id$V1), ",")), 
      subgroup=rep(g, lengths(i)))

giving

    group code   id subgroup
1       A  101  id1       g1
1.1     A  101  id2       g1
1.2     A  101  id3       g1
1.3     A  101  id4       g2
1.4     A  101  id5       g2
2       B  102  id6     <NA>
2.1     B  102  id7     <NA>
2.2     B  102  id8     <NA>
2.3     B  102  id9     <NA>
3       C  103 id10       g1
3.1     C  103 id11       g1
3.2     C  103 id12       g3
4       D  104 id13       g2
4.1     D  104 id14       g2
5       E  105 id15     <NA>
like image 40
Friede Avatar answered Sep 04 '25 03:09

Friede