I have a data frame with a dot-separated character column: <pre class="prettyprint"><code>> set.seed(310366) > tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep="")) > tst x y 1 1 BAR.baz 2 2 FOO.foo 3 3 BAZ.baz 4 4 BAZ.foo 5 5 BAZ.bar 6 6 FOO.baz 7 7 BAR.bar 8 8 BAZ.baz </code></pre> and I want to split that column into two new columns containing the parts on either side of the dot. <code>str_split_fixed</code> from package <code>stringr</code> can do the job quite nicely. All my values are definitely two parts separated by a dot so I can do: <pre class="prettyprint"><code>> require(stringr) > str_split_fixed(tst$y,"\\.",2) [,1] [,2] [1,] "BAR" "baz" [2,] "FOO" "foo" [3,] "BAZ" "baz" [4,] "BAZ" "foo" [5,] "BAZ" "bar" [6,] "FOO" "baz" [7,] "BAR" "bar" </code></pre> Now I could just <code>cbind</code> that to my data frame but I thought I'd figure out how to do that in a <code>dplyr</code> pipeline. First I thought <code>mutate</code> could do it in one: <pre class="prettyprint"><code>> tst %.% mutate(parts=str_split_fixed(y,"\\.",2)) Error: wrong result size (20), expected 10 or 1 </code></pre> I can get <code>mutate</code> to do it in two: <pre class="prettyprint"><code>> tst %.% mutate(part1=str_split_fixed(y,"\\.",2)[,1], part2=str_split_fixed(y,"\\.",2)[,2]) x y part1 part2 1 1 BAR.baz BAR baz 2 2 FOO.foo FOO foo 3 3 BAZ.baz BAZ baz 4 4 BAZ.foo BAZ foo 5 5 BAZ.bar BAZ bar 6 6 FOO.baz FOO baz </code></pre> but that's running the string split twice. "Best" I can do so far in a <code>dplyr</code> way is this (which I only discovered while writing this question...): <pre class="prettyprint"><code>> tst %.% do(cbind(.,data.frame(parts=str_split_fixed(.$y,"\\.",2)))) x y parts.1 parts.2 1 1 BAR.baz BAR baz 2 2 FOO.foo FOO foo 3 3 BAZ.baz BAZ baz 4 4 BAZ.foo BAZ foo 5 5 BAZ.bar BAZ bar </code></pre> which isn't bad, but loses a lot of the readability of piped things in R. Is there a simple approach using <code>mutate</code> that I've missed?

This answer applies here as well; the following approach is both tidyverse-idiomatic and more performant than <code>separate()</code> (as of 2020): <pre class="prettyprint lang-r prettyprint-override"><code>set.seed(310366) tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep="")) library(dplyr) library(purrr) tst %>% mutate(tmp_chunks = stringr::str_split(y, fixed("."), n = 2)) %>% mutate(y1 = map_chr(tmp_chunks, 1), y2 = map_chr(tmp_chunks, 2)) %>% select(-tmp_chunks) </code></pre> ... Or if you don't want <code>y</code> anymore after splitting it, you can change the last line to <pre class="prettyprint lang-r prettyprint-override"><code> select(-tmp_chunks, -y) </code></pre>

Adding multiple columns in a dplyr mutate call

Tags:

r

dplyr

I have a data frame with a dot-separated character column:

> set.seed(310366)
> tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep=""))
> tst
    x       y
1   1 BAR.baz
2   2 FOO.foo
3   3 BAZ.baz
4   4 BAZ.foo
5   5 BAZ.bar
6   6 FOO.baz
7   7 BAR.bar
8   8 BAZ.baz

and I want to split that column into two new columns containing the parts on either side of the dot. str_split_fixed from package stringr can do the job quite nicely. All my values are definitely two parts separated by a dot so I can do:

> require(stringr)
> str_split_fixed(tst$y,"\\.",2)
      [,1]  [,2] 
 [1,] "BAR" "baz"
 [2,] "FOO" "foo"
 [3,] "BAZ" "baz"
 [4,] "BAZ" "foo"
 [5,] "BAZ" "bar"
 [6,] "FOO" "baz"
 [7,] "BAR" "bar"

Now I could just cbind that to my data frame but I thought I'd figure out how to do that in a dplyr pipeline. First I thought mutate could do it in one:

> tst %.% mutate(parts=str_split_fixed(y,"\\.",2))
Error: wrong result size (20), expected 10 or 1

I can get mutate to do it in two:

> tst %.% mutate(part1=str_split_fixed(y,"\\.",2)[,1], part2=str_split_fixed(y,"\\.",2)[,2])
    x       y part1 part2
1   1 BAR.baz   BAR   baz
2   2 FOO.foo   FOO   foo
3   3 BAZ.baz   BAZ   baz
4   4 BAZ.foo   BAZ   foo
5   5 BAZ.bar   BAZ   bar
6   6 FOO.baz   FOO   baz

but that's running the string split twice.

"Best" I can do so far in a dplyr way is this (which I only discovered while writing this question...):

> tst %.% do(cbind(.,data.frame(parts=str_split_fixed(.$y,"\\.",2))))
    x       y parts.1 parts.2
1   1 BAR.baz     BAR     baz
2   2 FOO.foo     FOO     foo
3   3 BAZ.baz     BAZ     baz
4   4 BAZ.foo     BAZ     foo
5   5 BAZ.bar     BAZ     bar

which isn't bad, but loses a lot of the readability of piped things in R. Is there a simple approach using mutate that I've missed?

354

asked Jul 24 '14 14:07

Spacedman

2 Answers

You can use separate() from tidyr in combination with dplyr:

tst %>% separate(y, c("y1", "y2"), sep = "\\.", remove=FALSE)

    x       y  y1  y2
1   1 BAR.baz BAR baz
2   2 FOO.foo FOO foo
3   3 BAZ.baz BAZ baz
4   4 BAZ.foo BAZ foo
5   5 BAZ.bar BAZ bar
6   6 FOO.baz FOO baz
7   7 BAR.bar BAR bar
8   8 BAZ.baz BAZ baz
9   9 FOO.bar FOO bar
10 10 BAR.foo BAR foo

Setting remove=TRUE will remove column y

140

answered Oct 05 '22 11:10

erc

This answer applies here as well; the following approach is both tidyverse-idiomatic and more performant than separate() (as of 2020):

set.seed(310366)
tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep=""))

library(dplyr)
library(purrr)

tst %>% 
  mutate(tmp_chunks = stringr::str_split(y, fixed("."),  n = 2)) %>%
  mutate(y1 = map_chr(tmp_chunks, 1),
         y2 = map_chr(tmp_chunks, 2)) %>%
  select(-tmp_chunks)

... Or if you don't want y anymore after splitting it, you can change the last line to

  select(-tmp_chunks, -y)

answered Oct 05 '22 10:10

DomQ

Related questions
                            
                                gganimate barchart: smooth transition when bar is replaced
                            
                                Animated sorted bar chart: problem with overlapping bars
                            
                                knitr generating errors in document but generates figures correctly regardless
                            
                                Drawing a contour line around connected cells in a heatmap in R
                            
                                Keep auxiliary TeX files when rendering a rmarkdown document
                            
                                geom_point() rainbow colors
                            
                                Find all subsequences with specific length in sequence of numbers in R
                            
                                R datatable search option doesn't handle exotic encoding (latin1)
                            
                                Remove characters which repeat more than twice in a string [duplicate]
                            
                                Format negative currency values correctly with minus sign before the dollar sign
                            
                                How to change axis labels using with visreg along with ggplot2
                            
                                Implementing additional constraint variables in integer programming using lpSolve
                            
                                Building a stacked histogram with gganimate
                            
                                Labeling conditional events in dplyr with sequential data
                            
                                How to use cumsum-Lapply when i+1 column is needed?
                            
                                Very Fast string fuzzy matching in R
                            
                                Draggable interactive bar chart Rshiny
                            
                                Plotting histogram of a big matrix in ggplot2 is 20x slower than base hist()
                            
                                How to fill color by groups in histogram using Matplotlib?
                            
                                How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With