My question involves how to create a new variable on a data frame in R based on the result of a regular expression. Below is a minimal example of the data:
df <- data.frame(model=c("Legacy 2.0 BG5 B4 AUTO","Legacy 2.0 BH5 AT","Legacy 2.0i CVT Non Leather","Legacy 2.0i CVT","Legacy 2.0 BL5 AUTO B4",
"Legacy 2.0 BP5 AUTO","Legacy 2.0 BM5 AUTO CVT"), CRSP=c(3450000,3365000,4950000,5250000,4787526,3550000,5235000))
df
model CRSP
1 Legacy 2.0 BG5 B4 AUTO 3450000
2 Legacy 2.0 BH5 AT 3365000
3 Legacy 2.0i CVT Non Leather 4950000
4 Legacy 2.0i CVT 5250000
5 Legacy 2.0 BL5 AUTO B4 4787526
6 Legacy 2.0 BP5 AUTO 3550000
7 Legacy 2.0 BM5 AUTO CVT 5235000
I would like to create a new variable 'chassis' whose value is the third element of the corresponding 'model' variable string, thus ending up with:
df
model CRSP chassis
1 Legacy 2.0 BG5 B4 AUTO 3450000 BG5
2 Legacy 2.0 BH5 AT 3365000 BH5
3 Legacy 2.0i CVT Non Leather 4950000 CVT
4 Legacy 2.0i CVT 5250000 CVT
5 Legacy 2.0 BL5 AUTO B4 4787526 BL5
6 Legacy 2.0 BP5 AUTO 3550000 BP5
7 Legacy 2.0 BM5 AUTO CVT 5235000 BM5
I need to find a way of extracting the appropriate elements in each row and place them in the new variable. Any assistance would be greatly appreciated.
Regular expressions act like any other value, and can be assigned to variables and used in function arguments.
$ means "Match the end of the string" (the position after the last character in the string).
The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" .
Here's a possible solution using stringi
library(stringi)
df$chassis <- stri_extract_all_words(df$model, simplify = TRUE)[, 3]
df
# model CRSP chassis
# 1 Legacy 2.0 BG5 B4 AUTO 3450000 BG5
# 2 Legacy 2.0 BH5 AT 3365000 BH5
# 3 Legacy 2.0i CVT Non Leather 4950000 CVT
# 4 Legacy 2.0i CVT 5250000 CVT
# 5 Legacy 2.0 BL5 AUTO B4 4787526 BL5
# 6 Legacy 2.0 BP5 AUTO 3550000 BP5
# 7 Legacy 2.0 BM5 AUTO CVT 5235000 BM5
Or similarly
df$chassis <- sapply(stri_extract_all_words(df$model), `[`, 3)
I'm a big fan of tidyr for this sort of task and extracting all the pieces into separate columns:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
regx <- "(^[A-Za-z]+\\s+[0-9.a-z]+)\\s+([A-Z0-9]+)\\s*(.*)"
df %>%
extract(model, c("a", "chassis", "b"), regx, remove=FALSE)
## model a chassis b CRSP
## 1 Legacy 2.0 BG5 B4 AUTO Legacy 2.0 BG5 B4 AUTO 3450000
## 2 Legacy 2.0 BH5 AT Legacy 2.0 BH5 AT 3365000
## 3 Legacy 2.0i CVT Non Leather Legacy 2.0i CVT Non Leather 4950000
## 4 Legacy 2.0i CVT Legacy 2.0i CVT 5250000
## 5 Legacy 2.0 BL5 AUTO B4 Legacy 2.0 BL5 AUTO B4 4787526
## 6 Legacy 2.0 BP5 AUTO Legacy 2.0 BP5 AUTO 3550000
## 7 Legacy 2.0 BM5 AUTO CVT Legacy 2.0 BM5 AUTO CVT 5235000
You could get a bit more generic with this regex:
regx <- "(^[^ ]+\\s+[^ ]+)\\s+([^ ]+)\\s*(.*)"
Also note you can use extract
to get just the column you're after by dropping the grouping parenthesis on the first and last groups as follows:
regx <- "^[A-Za-z]+\\s+[0-9.a-z]+\\s+([A-Z0-9]+)\\s*.*"
df %>%
extract(model, "chassis", regx, remove=FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With