using strsplit and subset in dplyr and mutate

Question

I have a data table with one string column. I'd like to create another column that is a subset of this column using strsplit.

dat <- data.table(labels=c('a_1','b_2','c_3','d_4'))

The output I want is

label  sub_label
a_1    a
b_2    b
c_3    c
d_4    d

I've tried the followings but neither seems to work.

dat %>%
    mutate(
        sub_labels=strsplit(as.character(labels), "_")[[1]][1]
    ) 
# gives a column whose values are all "a"

this one, which seems logical to me,

dat %>%
    mutate(
        sub_labels=sapply(strsplit(as.character(labels), "_"), function(x) x[[1]][1])
    )

gives an error

Error: Don't know how to handle type pairlist

I saw another post where paste-collapse on the output from strsplit worked so I don't understand why subsetting in an anonymous function is giving issues. Thanks for any elucidation on this.

Romain Francois · Accepted Answer

tidyr::separate can help here:

> dat %>% separate(labels, c("first", "second") )
   first second
1:     a      1
2:     b      2
3:     c      3
4:     d      4

GenesRus · Answer

Another method uses purrr's map_chr, which I've found useful for applications where I didn't want to bother with separating and uniting (e.g. using the results in a sprintf with other strings):

tibble(labels=c('a_1','b_2','c_3','d_4')) %>% 
  mutate(sub_label = stringr::str_split(labels, "_") %>% map_chr(., 1))

This method can be substantially faster than separate in my experience, especially when you have longer datasets. separate barely beats map when I use 100 strings, but falls behind in most cases when I use 1000 (not sure what's up with that max).

    > microbenchmark::microbenchmark(
+   d.filtered_reads %>% head(1000) %>% 
+     mutate(name = stringr::str_split(Header, " ") %>% map_chr(., 1)) %>% 
+     select(-Header),
+   d.filtered_reads %>% head(1000) %>% 
+     separate(Header, into = c("name","index"), sep = " ") %>% 
+     select(-"index")
+ )
Unit: milliseconds
                                                                                                                          expr
 d.filtered_reads %>% head(1000) %>% mutate(name = stringr::str_split(Header,      " ") %>% map_chr(., 1)) %>% select(-Header)
          d.filtered_reads %>% head(1000) %>% separate(Header, into = c("name",      "index"), sep = " ") %>% select(-"index")
      min       lq     mean   median       uq       max neval
 5.333891 5.817589 6.292954 5.935706 6.059031 41.530089   100
 7.517316 8.031325 8.399471 8.500359 8.647468  9.855612   100

using strsplit and subset in dplyr and mutate

Tags:

chungkim271

Video Answer

2 Answers

Romain Francois

GenesRus

Recent Activity

Donate For Us

using strsplit and subset in dplyr and mutate

Tags:

chungkim271

Video Answer

2 Answers

Romain Francois

GenesRus

Related questions

Recent Activity

Donate For Us