Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split delimited strings in a column and insert as new rows [duplicate]

People also ask

How do you split a single text cell into multiple Rows using comma delimited?

In the Split Cells dialog box, select Split to Rows or Split to Columns in the Type section as you need. And in the Specify a separator section, select the Other option, enter the comma symbol into the textbox, and then click the OK button.

How do I split a column into a row in R?

How to Separate A Column into Rows in R? tidyr's separate_rows() makes it easier to do it. Let us make a toy dataframe with multiple names in a column and see two examples of separating the column into multiple rows, first using dplyr's mutate and unnest and then using the separate_rows() function from tidyr.

How do I split a string into a list in R?

To split a string in R, use the strsplit() method. The strsplit() is a built-in R function that splits the string vector into sub-strings. The strsplit() method returns the list, where each list item resembles the item of input that has been split.


As of Dec 2014, this can be done using the unnest function from Hadley Wickham's tidyr package (see release notes http://blog.rstudio.org/2014/12/08/tidyr-0-2-0/)

> library(tidyr)
> library(dplyr)
> mydf

  V1    V2
2  1 a,b,c
3  2   a,c
4  3   b,d
5  4   e,f
6  .     .


> mydf %>% 
    mutate(V2 = strsplit(as.character(V2), ",")) %>% 
    unnest(V2)

   V1 V2
1   1  a
2   1  b
3   1  c
4   2  a
5   2  c
6   3  b
7   3  d
8   4  e
9   4  f
10  .  .

Update 2017: note the separate_rows function as described by @Tif below.

It works so much better, and it allows to "unnest" multiple columns in a single statement:

> head(mydf)
geneid              chrom    start  end strand  length  gene_count
ENSG00000223972.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1    11869;12010;12179;12613;12613;12975;13221;13221;13453   12227;12057;12227;12721;12697;13052;13374;14409;13670   +;+;+;+;+;+;+;+;+   1735    11
ENSG00000227232.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1  14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534   14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570   -;-;-;-;-;-;-;-;-;-;-   1351    380
ENSG00000278267.1   chr1    17369   17436   -   68  14
ENSG00000243485.4   chr1;chr1;chr1;chr1;chr1    29554;30267;30564;30976;30976   30039;30667;30667;31097;31109   +;+;+;+;+   1021    22
ENSG00000237613.2   chr1;chr1;chr1  34554;35277;35721   35174;35481;36081   -;-;-   1187    24
ENSG00000268020.3   chr1    52473   53312   +   840 14


> mydf %>% separate_rows(strand, chrom, gene_start, gene_end)
geneid  length  gene_count  strand  chrom   start   end
ENSG00000223972.5   1735    11  +   chr1    11869   12227
ENSG00000223972.5   1735    11  +   chr1    12010   12057
ENSG00000223972.5   1735    11  +   chr1    12179   12227
ENSG00000223972.5   1735    11  +   chr1    12613   12721
ENSG00000223972.5   1735    11  +   chr1    12613   12697
ENSG00000223972.5   1735    11  +   chr1    12975   13052
ENSG00000223972.5   1735    11  +   chr1    13221   13374
ENSG00000223972.5   1735    11  +   chr1    13221   14409
ENSG00000223972.5   1735    11  +   chr1    13453   13670
ENSG00000227232.5   1351    380 -   chr1    14404   14501
ENSG00000227232.5   1351    380 -   chr1    15005   15038
ENSG00000227232.5   1351    380 -   chr1    15796   15947
ENSG00000227232.5   1351    380 -   chr1    16607   16765
ENSG00000227232.5   1351    380 -   chr1    16858   17055
ENSG00000227232.5   1351    380 -   chr1    17233   17368
ENSG00000227232.5   1351    380 -   chr1    17606   17742
ENSG00000227232.5   1351    380 -   chr1    17915   18061
ENSG00000227232.5   1351    380 -   chr1    18268   18366
ENSG00000227232.5   1351    380 -   chr1    24738   24891
ENSG00000227232.5   1351    380 -   chr1    29534   29570
ENSG00000278267.1   68  5   -   chr1    17369   17436
ENSG00000243485.4   1021    8   +   chr1    29554   30039
ENSG00000243485.4   1021    8   +   chr1    30267   30667
ENSG00000243485.4   1021    8   +   chr1    30564   30667
ENSG00000243485.4   1021    8   +   chr1    30976   31097
ENSG00000243485.4   1021    8   +   chr1    30976   31109
ENSG00000237613.2   1187    24  -   chr1    34554   35174
ENSG00000237613.2   1187    24  -   chr1    35277   35481
ENSG00000237613.2   1187    24  -   chr1    35721   36081
ENSG00000268020.3   840 0   +   chr1    52473   53312

Here is another way of doing it..

df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)

df
##   V1    V2
## 1  1 a,b,c
## 2  2   a,c
## 3  3   b,d
## 4  4   e,f

s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
##   V1 V2
## 1  1  a
## 2  1  b
## 3  1  c
## 4  2  a
## 5  2  c
## 6  3  b
## 7  3  d
## 8  4  e
## 9  4  f

Now you can use tidyr 0.5.0's separate_rows is in place of strsplit + unnest.

For example:

library(tidyr)
(df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F))
  V1    V2
1  1 a,b,c
2  2   a,c
3  3   b,d
4  4   e,f
separate_rows(df, V2)

Gives:

  V1 V2
1  1  a
2  1  b
3  1  c
4  2  a
5  2  c
6  3  b
7  3  d
8  4  e
9  4  f

See reference: https://blog.rstudio.org/2016/06/13/tidyr-0-5-0/


Here's a data.table solution:

d.df <- read.table(header=T, text="V1 | V2
1 | a,b,c
2 | a,c
3 | b,d
4 | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE)
require(data.table)
d.dt <- data.table(d.df, key="V1")
out <- d.dt[, list(V2 = unlist(strsplit(V2, ","))), by=V1]

#    V1 V2
# 1:  1  a
# 2:  1  b
# 3:  1  c
# 4:  2  a
# 5:  2  c
# 6:  3  b
# 7:  3  d
# 8:  4  e
# 9:  4  f

> sapply(out$V2, nchar) # (or simply nchar(out$V2))
# a b c a c b d e f 
# 1 1 1 1 1 1 1 1 1 

You can consider cSplit with direction = "long" from my "splitstackshape" package.

Usage would be:

cSplit(mydf, "V2", ",", "long")
##    V1 V2
## 1:  1  a
## 2:  1  b
## 3:  1  c
## 4:  2  a
## 5:  2  c
## 6:  3  b
## 7:  3  d
## 8:  4  e
## 9:  4  f

Old answer....

Here is one approach using base R. It assumes we're starting with a data.frame named "mydf". It uses read.csv to read in the second column as a separate data.frame, which we combine with the first column from your source data. Finally, you use reshape to convert the data into a long form.

temp <- data.frame(Ind = mydf$V1, 
                   read.csv(text = as.character(mydf$V2), header = FALSE))
temp1 <- reshape(temp, direction = "long", idvar = "Ind", 
                 timevar = "time", varying = 2:ncol(temp), sep = "")
temp1[!temp1$V == "", c("Ind", "V")]
#     Ind  V
# 1.1   1  a
# 2.1   2  a
# 3.1   3  b
# 4.1   4  e
# 1.2   1  b
# 2.2   2  c
# 3.2   3  d
# 4.2   4  f
# 1.3   1  c

Another fairly direct alternative is:

stack(
  setNames(
    sapply(strsplit(mydf$V2, ","), 
           function(x) gsub("^\\s|\\s$", "", x)), mydf$V1))
  values ind
1      a   1
2      b   1
3      c   1
4      a   2
5      c   2
6      b   3
7      d   3
8      e   4
9      f   4

Another data.table solution, which doesn't rely on the existence of any unique fields in the original data.

DT = data.table(read.table(header=T, text="blah | splitme
    T | a,b,c
    T | a,c
    F | b,d
    F | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE))

DT[,.( blah
     , splitme
     , splitted=unlist(strsplit(splitme, ","))
     ),by=seq_len(nrow(DT))]

The important thing is by=seq_len(nrow(DT)), this is the 'fake' uniqueID that the splitting occurs on. It's tempting to use by=.I instead, as it should be defined the same, but .I appears to be a magical thing that changes its value, better to stick with by=seq_len(nrow(DT))

There are three columns in the output. We simply name the two existing columns, and then compute the third as a split

.( blah       # first column of original
 , splitme    # second column of original
 , splitted = unlist(strsplit(splitme, ","))
 )