create new variable based on a regular expression

Tags:

regex

r

My question involves how to create a new variable on a data frame in R based on the result of a regular expression. Below is a minimal example of the data:

df <- data.frame(model=c("Legacy 2.0  BG5 B4 AUTO","Legacy 2.0 BH5 AT","Legacy 2.0i CVT Non Leather","Legacy 2.0i CVT","Legacy 2.0 BL5 AUTO B4",
                 "Legacy 2.0 BP5 AUTO","Legacy 2.0 BM5 AUTO CVT"), CRSP=c(3450000,3365000,4950000,5250000,4787526,3550000,5235000))

df
                        model    CRSP
1     Legacy 2.0  BG5 B4 AUTO 3450000
2           Legacy 2.0 BH5 AT 3365000
3 Legacy 2.0i CVT Non Leather 4950000
4             Legacy 2.0i CVT 5250000
5      Legacy 2.0 BL5 AUTO B4 4787526
6         Legacy 2.0 BP5 AUTO 3550000
7     Legacy 2.0 BM5 AUTO CVT 5235000

I would like to create a new variable 'chassis' whose value is the third element of the corresponding 'model' variable string, thus ending up with:

df
                        model    CRSP chassis
1     Legacy 2.0  BG5 B4 AUTO 3450000     BG5
2           Legacy 2.0 BH5 AT 3365000     BH5
3 Legacy 2.0i CVT Non Leather 4950000     CVT
4             Legacy 2.0i CVT 5250000     CVT
5      Legacy 2.0 BL5 AUTO B4 4787526     BL5
6         Legacy 2.0 BP5 AUTO 3550000     BP5
7     Legacy 2.0 BM5 AUTO CVT 5235000     BM5

I need to find a way of extracting the appropriate elements in each row and place them in the new variable. Any assistance would be greatly appreciated.

443

asked Apr 21 '15 12:04

amo

2 Answers

Here's a possible solution using stringi

library(stringi)
df$chassis <- stri_extract_all_words(df$model, simplify = TRUE)[, 3]
df
#                         model    CRSP chassis
# 1     Legacy 2.0  BG5 B4 AUTO 3450000     BG5
# 2           Legacy 2.0 BH5 AT 3365000     BH5
# 3 Legacy 2.0i CVT Non Leather 4950000     CVT
# 4             Legacy 2.0i CVT 5250000     CVT
# 5      Legacy 2.0 BL5 AUTO B4 4787526     BL5
# 6         Legacy 2.0 BP5 AUTO 3550000     BP5
# 7     Legacy 2.0 BM5 AUTO CVT 5235000     BM5

Or similarly

df$chassis <- sapply(stri_extract_all_words(df$model), `[`, 3)

102

answered Oct 09 '22 18:10

David Arenburg

I'm a big fan of tidyr for this sort of task and extracting all the pieces into separate columns:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)

regx <- "(^[A-Za-z]+\\s+[0-9.a-z]+)\\s+([A-Z0-9]+)\\s*(.*)"

df %>%
    extract(model, c("a", "chassis", "b"), regx, remove=FALSE)

##                         model           a chassis           b    CRSP
## 1     Legacy 2.0  BG5 B4 AUTO  Legacy 2.0     BG5     B4 AUTO 3450000
## 2           Legacy 2.0 BH5 AT  Legacy 2.0     BH5          AT 3365000
## 3 Legacy 2.0i CVT Non Leather Legacy 2.0i     CVT Non Leather 4950000
## 4             Legacy 2.0i CVT Legacy 2.0i     CVT             5250000
## 5      Legacy 2.0 BL5 AUTO B4  Legacy 2.0     BL5     AUTO B4 4787526
## 6         Legacy 2.0 BP5 AUTO  Legacy 2.0     BP5        AUTO 3550000
## 7     Legacy 2.0 BM5 AUTO CVT  Legacy 2.0     BM5    AUTO CVT 5235000

You could get a bit more generic with this regex:

regx <- "(^[^ ]+\\s+[^ ]+)\\s+([^ ]+)\\s*(.*)"

Also note you can use extract to get just the column you're after by dropping the grouping parenthesis on the first and last groups as follows:

regx <- "^[A-Za-z]+\\s+[0-9.a-z]+\\s+([A-Z0-9]+)\\s*.*"

df %>% 
    extract(model, "chassis", regx, remove=FALSE)

answered Oct 09 '22 18:10

Tyler Rinker

Related questions
                            
                                Regex for finding valid filename
                            
                                Easy way to convert regex to a java compatible regex?
                            
                                Parsing XML in Python with regex
                            
                                Regex for only allowing letters, numbers, space, commas, periods?
                            
                                How to Remove a Substring of String in a Dataframe Column?
                            
                                Performing regex on a stream
                            
                                Boost C++ regex - how to get multiple matches
                            
                                Finding the indexes of multiple/overlapping matching substrings
                            
                                Javascript Regular Expression multiple match [duplicate]
                            
                                Python Regex for hyphenated words
                            
                                Regex for parsing single key: values out of JSON in Javascript
                            
                                C# Regex to validate phone number
                            
                                When do I need to escape characters within a regex character set (within [])?
                            
                                Python re module becomes 20 times slower when looping on more than 100 different regex
                            
                                Regular expression for persian(arabic) letters without any numbers
                            
                                Redirect Non WWW to WWW using Asp.Net Core Middleware
                            
                                Regular expression listing all possibilities
                            
                                Validate string is base64 format using RegEx?
                            
                                rewriterule in htaccess to match certain file extensions
                            
                                How to do regex string replacements in pure C?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With