Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting column into multi-columns

Tags:

r

data.table

For the data given below,

data1<-structure(list(var1 = c("2 7", "2 6 7", "2 7", "2 7", "1 7", 
"1 7", "1 5", "1 2 7", "1 5", "1 7", "1 2 3 4 5 6 7", "1 2 4 6"
)), .Names = "var1", class = "data.frame", row.names = c(NA, 
-12L))

> data1

            var1
1            2 7
2          2 6 7
3            2 7
4            2 7
5            1 7
6            1 7
7            1 5
8          1 2 7
9            1 5
10           1 7
11 1 2 3 4 5 6 7
12       1 2 4 6

I would like it to split into seven columns (7) as follows:

    v1  v2  v3  v4  v5  v6  v7
1   NA  2   NA  NA  NA  NA  7
2   NA  2   NA  NA  NA  6   7
3   NA  2   NA  NA  NA  NA  7
4   NA  2   NA  NA  NA  NA  7
5   1   NA  NA  NA  NA  NA  7
6   1   NA  NA  NA  NA  NA  7
7   1   NA  NA  NA  5   NA  NA
8   1   2   NA  NA  NA  NA  7
9   1   NA  NA  NA  5   NA  NA
10  1   NA  NA  NA  NA  NA  7
11  1   2   3   4   5   6   7
12  1   2   NA  4   NA  6   NA

I use the tstrsplit from data.table package as follows:

library(data.table)
setDT(data1)[,tstrsplit(var1," ")]



 V1 V2 V3 V4 V5 V6 V7
 1:  2  7 NA NA NA NA NA
 2:  2  6  7 NA NA NA NA
 3:  2  7 NA NA NA NA NA
 4:  2  7 NA NA NA NA NA
 5:  1  7 NA NA NA NA NA
 6:  1  7 NA NA NA NA NA
 7:  1  5 NA NA NA NA NA
 8:  1  2  7 NA NA NA NA
 9:  1  5 NA NA NA NA NA
10:  1  7 NA NA NA NA NA
11:  1  2  3  4  5  6  7
12:  1  2  4  6 NA NA NA

This is different than the expected output. I was wondering how can I get the expected output as described above.

like image 336
Metrics Avatar asked Mar 08 '23 03:03

Metrics


1 Answers

With data.table you may try

library(magrittr)
setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))] %>% 
  dcast(., rn ~ V1)
       rn      1      2      3      4      5      6      7
 1:     1     NA      2     NA     NA     NA     NA      7
 2:     2     NA      2     NA     NA     NA      6      7
 3:     3     NA      2     NA     NA     NA     NA      7
 4:     4     NA      2     NA     NA     NA     NA      7
 5:     5      1     NA     NA     NA     NA     NA      7
 6:     6      1     NA     NA     NA     NA     NA      7
 7:     7      1     NA     NA     NA      5     NA     NA
 8:     8      1      2     NA     NA     NA     NA      7
 9:     9      1     NA     NA     NA      5     NA     NA
10:    10      1     NA     NA     NA     NA     NA      7
11:    11      1      2      3      4      5      6      7
12:    12      1      2     NA      4     NA      6     NA

To get rid of the rn column, we can use

setDT(data1)[, strsplit(var1," "), by = .(rn = 1:nrow(data1))][
  , dcast(.SD, rn ~ V1)][, rn := NULL][]

Explanation

setDT(data1)[, strsplit(var1," "), by = .(rn = seq_len(nrow(data1)))]

creates a data.table directly in long format

    rn V1
 1:  1  2
 2:  1  7
 3:  2  2
 4:  2  6
 5:  2  7
 6:  3  2
 7:  3  7
 8:  4  2
 9:  4  7
10:  5  1
11:  5  7
12:  6  1
13:  6  7
14:  7  1
15:  7  5
16:  8  1
17:  8  2
18:  8  7
19:  9  1
20:  9  5
21: 10  1
22: 10  7
23: 11  1
24: 11  2
25: 11  3
26: 11  4
27: 11  5
28: 11  6
29: 11  7
30: 12  1
31: 12  2
32: 12  4
33: 12  6
    rn V1

which is then reshaped to wide format using dcast().

If we would use tstrsplit() instead of strsplit() we would get a data.table in wide format which needs to be reshaped to long format using melt():

setDT(data1)[,tstrsplit(var1," ")][, rn := .I][
  , melt(.SD, id = "rn", na.rm = TRUE)][
    , dcast(.SD, rn ~ paste0("V", value))][
      , rn := NULL][]
like image 64
Uwe Avatar answered Mar 15 '23 22:03

Uwe