Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using strsplit and subset in dplyr and mutate

Tags:

I have a data table with one string column. I'd like to create another column that is a subset of this column using strsplit.

dat <- data.table(labels=c('a_1','b_2','c_3','d_4'))

The output I want is

label  sub_label
a_1    a
b_2    b
c_3    c
d_4    d 

I've tried the followings but neither seems to work.

dat %>%
    mutate(
        sub_labels=strsplit(as.character(labels), "_")[[1]][1]
    ) 
# gives a column whose values are all "a"

this one, which seems logical to me,

dat %>%
    mutate(
        sub_labels=sapply(strsplit(as.character(labels), "_"), function(x) x[[1]][1])
    )

gives an error

Error: Don't know how to handle type pairlist

I saw another post where paste-collapse on the output from strsplit worked so I don't understand why subsetting in an anonymous function is giving issues. Thanks for any elucidation on this.

like image 413
chungkim271 Avatar asked Mar 02 '17 20:03

chungkim271


Video Answer


2 Answers

tidyr::separate can help here:

> dat %>% separate(labels, c("first", "second") )
   first second
1:     a      1
2:     b      2
3:     c      3
4:     d      4    
like image 125
Romain Francois Avatar answered Oct 12 '22 09:10

Romain Francois


Another method uses purrr's map_chr, which I've found useful for applications where I didn't want to bother with separating and uniting (e.g. using the results in a sprintf with other strings):

tibble(labels=c('a_1','b_2','c_3','d_4')) %>% 
  mutate(sub_label = stringr::str_split(labels, "_") %>% map_chr(., 1))

This method can be substantially faster than separate in my experience, especially when you have longer datasets. separate barely beats map when I use 100 strings, but falls behind in most cases when I use 1000 (not sure what's up with that max).

    > microbenchmark::microbenchmark(
+   d.filtered_reads %>% head(1000) %>% 
+     mutate(name = stringr::str_split(Header, " ") %>% map_chr(., 1)) %>% 
+     select(-Header),
+   d.filtered_reads %>% head(1000) %>% 
+     separate(Header, into = c("name","index"), sep = " ") %>% 
+     select(-"index")
+ )
Unit: milliseconds
                                                                                                                          expr
 d.filtered_reads %>% head(1000) %>% mutate(name = stringr::str_split(Header,      " ") %>% map_chr(., 1)) %>% select(-Header)
          d.filtered_reads %>% head(1000) %>% separate(Header, into = c("name",      "index"), sep = " ") %>% select(-"index")
      min       lq     mean   median       uq       max neval
 5.333891 5.817589 6.292954 5.935706 6.059031 41.530089   100
 7.517316 8.031325 8.399471 8.500359 8.647468  9.855612   100
like image 23
GenesRus Avatar answered Oct 12 '22 09:10

GenesRus