I have a data frame with a dot-separated character column:
> set.seed(310366)
> tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep=""))
> tst
x y
1 1 BAR.baz
2 2 FOO.foo
3 3 BAZ.baz
4 4 BAZ.foo
5 5 BAZ.bar
6 6 FOO.baz
7 7 BAR.bar
8 8 BAZ.baz
and I want to split that column into two new columns containing the parts on either side of the dot. str_split_fixed
from package stringr
can do the job quite nicely. All my values are definitely two parts separated by a dot so I can do:
> require(stringr)
> str_split_fixed(tst$y,"\\.",2)
[,1] [,2]
[1,] "BAR" "baz"
[2,] "FOO" "foo"
[3,] "BAZ" "baz"
[4,] "BAZ" "foo"
[5,] "BAZ" "bar"
[6,] "FOO" "baz"
[7,] "BAR" "bar"
Now I could just cbind
that to my data frame but I thought I'd figure out how to do that in a dplyr
pipeline. First I thought mutate
could do it in one:
> tst %.% mutate(parts=str_split_fixed(y,"\\.",2))
Error: wrong result size (20), expected 10 or 1
I can get mutate
to do it in two:
> tst %.% mutate(part1=str_split_fixed(y,"\\.",2)[,1], part2=str_split_fixed(y,"\\.",2)[,2])
x y part1 part2
1 1 BAR.baz BAR baz
2 2 FOO.foo FOO foo
3 3 BAZ.baz BAZ baz
4 4 BAZ.foo BAZ foo
5 5 BAZ.bar BAZ bar
6 6 FOO.baz FOO baz
but that's running the string split twice.
"Best" I can do so far in a dplyr
way is this (which I only discovered while writing this question...):
> tst %.% do(cbind(.,data.frame(parts=str_split_fixed(.$y,"\\.",2))))
x y parts.1 parts.2
1 1 BAR.baz BAR baz
2 2 FOO.foo FOO foo
3 3 BAZ.baz BAZ baz
4 4 BAZ.foo BAZ foo
5 5 BAZ.bar BAZ bar
which isn't bad, but loses a lot of the readability of piped things in R. Is there a simple approach using mutate
that I've missed?
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
More specifically, you will learn 1) to add a column using base R (i.e., by using the $-operator and brackets, 2) add a column using the add_column() function (i.e., from tibble), 3) add multiple columns, and 4) to add columns from one dataframe to another.
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
You can use separate()
from tidyr
in combination with dplyr
:
tst %>% separate(y, c("y1", "y2"), sep = "\\.", remove=FALSE)
x y y1 y2
1 1 BAR.baz BAR baz
2 2 FOO.foo FOO foo
3 3 BAZ.baz BAZ baz
4 4 BAZ.foo BAZ foo
5 5 BAZ.bar BAZ bar
6 6 FOO.baz FOO baz
7 7 BAR.bar BAR bar
8 8 BAZ.baz BAZ baz
9 9 FOO.bar FOO bar
10 10 BAR.foo BAR foo
Setting remove=TRUE
will remove column y
This answer applies here as well; the following approach is both tidyverse-idiomatic and more performant than separate()
(as of 2020):
set.seed(310366)
tst = data.frame(x=1:10,y=paste(sample(c("FOO","BAR","BAZ"),10,TRUE),".",sample(c("foo","bar","baz"),10,TRUE),sep=""))
library(dplyr)
library(purrr)
tst %>%
mutate(tmp_chunks = stringr::str_split(y, fixed("."), n = 2)) %>%
mutate(y1 = map_chr(tmp_chunks, 1),
y2 = map_chr(tmp_chunks, 2)) %>%
select(-tmp_chunks)
... Or if you don't want y
anymore after splitting it, you can change the last line to
select(-tmp_chunks, -y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With