I have a sequence of variables in a dataframe (over 100) and I would like to create an indicator variable for if particular text patterns are present in any of the variables. Below is an example with three variables. One solution I've found is using tidyr::unite()
followed by dplyr::mutate()
, but I'm interested in a solution where I do not have to unite the variables.
c1<-c("T1", "X1", "T6", "R5")
c2<-c("R4", "C6", "C7", "X3")
c3<-c("C5", "C2", "X4", "T2")
df<-data.frame(c1, c2, c3)
c1 c2 c3
1 T1 R4 C5
2 X1 C6 C2
3 T6 C7 X4
4 R5 X3 T2
code.vec<-c("T1", "T2", "T3", "T4") #Text patterns of interest
code_regex<-paste(code.vec, collapse="|")
new<-df %>%
unite(all_c, c1:c3, remove=FALSE) %>%
mutate(indicator=if_else(grepl(code_regex, all_c), 1, 0)) %>%
select(-(all_c))
c1 c2 c3 indicator
1 T1 R4 C5 1
2 X1 C6 C2 0
3 T6 C7 X4 0
4 R5 X3 T2 1
Above is an example that produces the desired result, however I feel as if there should be a way of doing this in tidyverse
without having to unite the variables. This is something that SAS handles very easily using an ARRAY
statement and a DO
loop, and I'm hoping R has a good way of handling this.
The real dataframe has many additional variables besides from the "c" fields to search, so a solution that involves searching every column would require subsetting the dataframe to first only contain the variables I want to search, and then joining the data back with the other variables.
Merging datasets You can merge columns, by adding new variables; or you can merge rows, by adding observations. To add columns use the function merge() which requires that datasets you will merge to have a common variable. In case that datasets doesn't have a common variable use the function cbind .
How to Create Lists in R? We can use the list() function to create a list. Another way to create a list is to use the c() function. The c() function coerces elements into the same type, so, if there is a list amongst the elements, then all elements are turned into components of a list.
In Python, a list is created by placing elements inside square brackets [] , separated by commas. A list can have any number of items and they may be of different types (integer, float, string, etc.).
Using base R, we can use sapply
and use grepl
to find pattern in every column and assign 1 to rows where there is more than 0 matches.
df$indicator <- as.integer(rowSums(sapply(df, grepl, pattern = code_regex)) > 0)
df
# c1 c2 c3 indicator
#1 T1 R4 C5 1
#2 X1 C6 C2 0
#3 T6 C7 X4 0
#4 R5 X3 T2 1
If there are few other columns and we are interested to apply it only for columns which start with "c"
we can use grep
to filter them.
cols <- grep("^c", names(df))
as.integer(rowSums(sapply(df[cols], grepl, pattern = code_regex)) > 0)
Using dplyr
we can do
library(dplyr)
df$indicator <- as.integer(df %>%
mutate_at(vars(c1:c3), ~grepl(code_regex, .)) %>%
rowSums() > 0)
We can use tidyverse
library(tidyverse)
df %>%
mutate_all(str_detect, pattern = code_regex) %>%
reduce(`+`) %>%
mutate(df, indicator = .)
# c1 c2 c3 indicator
#1 T1 R4 C5 1
#2 X1 C6 C2 0
#3 T6 C7 X4 0
#4 R5 X3 T2 1
Or using base R
Reduce(`+`, lapply(df, grepl, pattern = code_regex))
#[1] 1 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With