Split delimited strings in a column and insert as new rows [duplicate]

People also ask

How do you split a single text cell into multiple Rows using comma delimited?

In the Split Cells dialog box, select Split to Rows or Split to Columns in the Type section as you need. And in the Specify a separator section, select the Other option, enter the comma symbol into the textbox, and then click the OK button.

How do I split a column into a row in R?

How to Separate A Column into Rows in R? tidyr's separate_rows() makes it easier to do it. Let us make a toy dataframe with multiple names in a column and see two examples of separating the column into multiple rows, first using dplyr's mutate and unnest and then using the separate_rows() function from tidyr.

How do I split a string into a list in R?

To split a string in R, use the strsplit() method. The strsplit() is a built-in R function that splits the string vector into sub-strings. The strsplit() method returns the list, where each list item resembles the item of input that has been split.

As of Dec 2014, this can be done using the unnest function from Hadley Wickham's tidyr package (see release notes http://blog.rstudio.org/2014/12/08/tidyr-0-2-0/)

> library(tidyr)
> library(dplyr)
> mydf

  V1    V2
2  1 a,b,c
3  2   a,c
4  3   b,d
5  4   e,f
6  .     .


> mydf %>% 
    mutate(V2 = strsplit(as.character(V2), ",")) %>% 
    unnest(V2)

   V1 V2
1   1  a
2   1  b
3   1  c
4   2  a
5   2  c
6   3  b
7   3  d
8   4  e
9   4  f
10  .  .

Update 2017: note the separate_rows function as described by @Tif below.

It works so much better, and it allows to "unnest" multiple columns in a single statement:

> head(mydf)
geneid              chrom    start  end strand  length  gene_count
ENSG00000223972.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1    11869;12010;12179;12613;12613;12975;13221;13221;13453   12227;12057;12227;12721;12697;13052;13374;14409;13670   +;+;+;+;+;+;+;+;+   1735    11
ENSG00000227232.5   chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1  14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534   14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570   -;-;-;-;-;-;-;-;-;-;-   1351    380
ENSG00000278267.1   chr1    17369   17436   -   68  14
ENSG00000243485.4   chr1;chr1;chr1;chr1;chr1    29554;30267;30564;30976;30976   30039;30667;30667;31097;31109   +;+;+;+;+   1021    22
ENSG00000237613.2   chr1;chr1;chr1  34554;35277;35721   35174;35481;36081   -;-;-   1187    24
ENSG00000268020.3   chr1    52473   53312   +   840 14


> mydf %>% separate_rows(strand, chrom, gene_start, gene_end)
geneid  length  gene_count  strand  chrom   start   end
ENSG00000223972.5   1735    11  +   chr1    11869   12227
ENSG00000223972.5   1735    11  +   chr1    12010   12057
ENSG00000223972.5   1735    11  +   chr1    12179   12227
ENSG00000223972.5   1735    11  +   chr1    12613   12721
ENSG00000223972.5   1735    11  +   chr1    12613   12697
ENSG00000223972.5   1735    11  +   chr1    12975   13052
ENSG00000223972.5   1735    11  +   chr1    13221   13374
ENSG00000223972.5   1735    11  +   chr1    13221   14409
ENSG00000223972.5   1735    11  +   chr1    13453   13670
ENSG00000227232.5   1351    380 -   chr1    14404   14501
ENSG00000227232.5   1351    380 -   chr1    15005   15038
ENSG00000227232.5   1351    380 -   chr1    15796   15947
ENSG00000227232.5   1351    380 -   chr1    16607   16765
ENSG00000227232.5   1351    380 -   chr1    16858   17055
ENSG00000227232.5   1351    380 -   chr1    17233   17368
ENSG00000227232.5   1351    380 -   chr1    17606   17742
ENSG00000227232.5   1351    380 -   chr1    17915   18061
ENSG00000227232.5   1351    380 -   chr1    18268   18366
ENSG00000227232.5   1351    380 -   chr1    24738   24891
ENSG00000227232.5   1351    380 -   chr1    29534   29570
ENSG00000278267.1   68  5   -   chr1    17369   17436
ENSG00000243485.4   1021    8   +   chr1    29554   30039
ENSG00000243485.4   1021    8   +   chr1    30267   30667
ENSG00000243485.4   1021    8   +   chr1    30564   30667
ENSG00000243485.4   1021    8   +   chr1    30976   31097
ENSG00000243485.4   1021    8   +   chr1    30976   31109
ENSG00000237613.2   1187    24  -   chr1    34554   35174
ENSG00000237613.2   1187    24  -   chr1    35277   35481
ENSG00000237613.2   1187    24  -   chr1    35721   36081
ENSG00000268020.3   840 0   +   chr1    52473   53312

Here is another way of doing it..

df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)

df
##   V1    V2
## 1  1 a,b,c
## 2  2   a,c
## 3  3   b,d
## 4  4   e,f

s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
##   V1 V2
## 1  1  a
## 2  1  b
## 3  1  c
## 4  2  a
## 5  2  c
## 6  3  b
## 7  3  d
## 8  4  e
## 9  4  f

Now you can use tidyr 0.5.0's separate_rows is in place of strsplit + unnest.

For example:

library(tidyr)
(df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F))

  V1    V2
1  1 a,b,c
2  2   a,c
3  3   b,d
4  4   e,f

separate_rows(df, V2)

Gives:

See reference: https://blog.rstudio.org/2016/06/13/tidyr-0-5-0/

Here's a data.table solution:

d.df <- read.table(header=T, text="V1 | V2
1 | a,b,c
2 | a,c
3 | b,d
4 | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE)
require(data.table)
d.dt <- data.table(d.df, key="V1")
out <- d.dt[, list(V2 = unlist(strsplit(V2, ","))), by=V1]

#    V1 V2
# 1:  1  a
# 2:  1  b
# 3:  1  c
# 4:  2  a
# 5:  2  c
# 6:  3  b
# 7:  3  d
# 8:  4  e
# 9:  4  f

> sapply(out$V2, nchar) # (or simply nchar(out$V2))
# a b c a c b d e f 
# 1 1 1 1 1 1 1 1 1

You can consider cSplit with direction = "long" from my "splitstackshape" package.

Usage would be:

cSplit(mydf, "V2", ",", "long")
##    V1 V2
## 1:  1  a
## 2:  1  b
## 3:  1  c
## 4:  2  a
## 5:  2  c
## 6:  3  b
## 7:  3  d
## 8:  4  e
## 9:  4  f

Old answer....

Here is one approach using base R. It assumes we're starting with a data.frame named "mydf". It uses read.csv to read in the second column as a separate data.frame, which we combine with the first column from your source data. Finally, you use reshape to convert the data into a long form.

temp <- data.frame(Ind = mydf$V1, 
                   read.csv(text = as.character(mydf$V2), header = FALSE))
temp1 <- reshape(temp, direction = "long", idvar = "Ind", 
                 timevar = "time", varying = 2:ncol(temp), sep = "")
temp1[!temp1$V == "", c("Ind", "V")]
#     Ind  V
# 1.1   1  a
# 2.1   2  a
# 3.1   3  b
# 4.1   4  e
# 1.2   1  b
# 2.2   2  c
# 3.2   3  d
# 4.2   4  f
# 1.3   1  c

Another fairly direct alternative is:

stack(
  setNames(
    sapply(strsplit(mydf$V2, ","), 
           function(x) gsub("^\\s|\\s$", "", x)), mydf$V1))
  values ind
1      a   1
2      b   1
3      c   1
4      a   2
5      c   2
6      b   3
7      d   3
8      e   4
9      f   4

Another data.table solution, which doesn't rely on the existence of any unique fields in the original data.

DT = data.table(read.table(header=T, text="blah | splitme
    T | a,b,c
    T | a,c
    F | b,d
    F | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE))

DT[,.( blah
     , splitme
     , splitted=unlist(strsplit(splitme, ","))
     ),by=seq_len(nrow(DT))]

The important thing is by=seq_len(nrow(DT)), this is the 'fake' uniqueID that the splitting occurs on. It's tempting to use by=.I instead, as it should be defined the same, but .I appears to be a magical thing that changes its value, better to stick with by=seq_len(nrow(DT))

There are three columns in the output. We simply name the two existing columns, and then compute the third as a split

.( blah       # first column of original
 , splitme    # second column of original
 , splitted = unlist(strsplit(splitme, ","))
 )

Related questions
                            
                                How to remove last n characters from every element in the R vector
                            
                                Select first 4 rows of a data.frame in R
                            
                                passing several arguments to FUN of lapply (and others *apply)
                            
                                How to use Greek symbols in ggplot2?
                            
                                Change value of variable with dplyr
                            
                                Finding row index containing maximum value using R
                            
                                Control the size of points in an R scatterplot?
                            
                                remove legend title in ggplot
                            
                                How to test if list element exists?
                            
                                Get type of all variables
                            
                                remove all variables except functions
                            
                                Convert column classes in data.table
                            
                                How can I plot with 2 different y-axes?
                            
                                'Incomplete final line' warning when trying to read a .csv file into R
                            
                                How to add \newpage in Rmarkdown in a smart way?
                            
                                Controlling number of decimal digits in print output in R
                            
                                Printing newlines with print() in R
                            
                                adding x and y axis labels in ggplot2
                            
                                Check if the number is integer
                            
                                To find whether a column exists in data frame or not

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split delimited strings in a column and insert as new rows [duplicate]

Tags:

dataframe

r

data-manipulation

reshape

strsplit

People also ask

Old answer....

Recent Activity

Donate For Us