Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create new variable based on a regular expression

Tags:

regex

r

My question involves how to create a new variable on a data frame in R based on the result of a regular expression. Below is a minimal example of the data:

df <- data.frame(model=c("Legacy 2.0  BG5 B4 AUTO","Legacy 2.0 BH5 AT","Legacy 2.0i CVT Non Leather","Legacy 2.0i CVT","Legacy 2.0 BL5 AUTO B4",
                 "Legacy 2.0 BP5 AUTO","Legacy 2.0 BM5 AUTO CVT"), CRSP=c(3450000,3365000,4950000,5250000,4787526,3550000,5235000))

df
                        model    CRSP
1     Legacy 2.0  BG5 B4 AUTO 3450000
2           Legacy 2.0 BH5 AT 3365000
3 Legacy 2.0i CVT Non Leather 4950000
4             Legacy 2.0i CVT 5250000
5      Legacy 2.0 BL5 AUTO B4 4787526
6         Legacy 2.0 BP5 AUTO 3550000
7     Legacy 2.0 BM5 AUTO CVT 5235000

I would like to create a new variable 'chassis' whose value is the third element of the corresponding 'model' variable string, thus ending up with:

df
                        model    CRSP chassis
1     Legacy 2.0  BG5 B4 AUTO 3450000     BG5
2           Legacy 2.0 BH5 AT 3365000     BH5
3 Legacy 2.0i CVT Non Leather 4950000     CVT
4             Legacy 2.0i CVT 5250000     CVT
5      Legacy 2.0 BL5 AUTO B4 4787526     BL5
6         Legacy 2.0 BP5 AUTO 3550000     BP5
7     Legacy 2.0 BM5 AUTO CVT 5235000     BM5

I need to find a way of extracting the appropriate elements in each row and place them in the new variable. Any assistance would be greatly appreciated.

like image 443
amo Avatar asked Apr 21 '15 12:04

amo


People also ask

Can you use variables in regular expressions?

Regular expressions act like any other value, and can be assigned to variables and used in function arguments.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How do you match a regular expression?

The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" .


2 Answers

Here's a possible solution using stringi

library(stringi)
df$chassis <- stri_extract_all_words(df$model, simplify = TRUE)[, 3]
df
#                         model    CRSP chassis
# 1     Legacy 2.0  BG5 B4 AUTO 3450000     BG5
# 2           Legacy 2.0 BH5 AT 3365000     BH5
# 3 Legacy 2.0i CVT Non Leather 4950000     CVT
# 4             Legacy 2.0i CVT 5250000     CVT
# 5      Legacy 2.0 BL5 AUTO B4 4787526     BL5
# 6         Legacy 2.0 BP5 AUTO 3550000     BP5
# 7     Legacy 2.0 BM5 AUTO CVT 5235000     BM5

Or similarly

df$chassis <- sapply(stri_extract_all_words(df$model), `[`, 3)
like image 102
David Arenburg Avatar answered Oct 09 '22 18:10

David Arenburg


I'm a big fan of tidyr for this sort of task and extracting all the pieces into separate columns:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)

regx <- "(^[A-Za-z]+\\s+[0-9.a-z]+)\\s+([A-Z0-9]+)\\s*(.*)"

df %>%
    extract(model, c("a", "chassis", "b"), regx, remove=FALSE)

##                         model           a chassis           b    CRSP
## 1     Legacy 2.0  BG5 B4 AUTO  Legacy 2.0     BG5     B4 AUTO 3450000
## 2           Legacy 2.0 BH5 AT  Legacy 2.0     BH5          AT 3365000
## 3 Legacy 2.0i CVT Non Leather Legacy 2.0i     CVT Non Leather 4950000
## 4             Legacy 2.0i CVT Legacy 2.0i     CVT             5250000
## 5      Legacy 2.0 BL5 AUTO B4  Legacy 2.0     BL5     AUTO B4 4787526
## 6         Legacy 2.0 BP5 AUTO  Legacy 2.0     BP5        AUTO 3550000
## 7     Legacy 2.0 BM5 AUTO CVT  Legacy 2.0     BM5    AUTO CVT 5235000

You could get a bit more generic with this regex:

regx <- "(^[^ ]+\\s+[^ ]+)\\s+([^ ]+)\\s*(.*)"

Also note you can use extract to get just the column you're after by dropping the grouping parenthesis on the first and last groups as follows:

regx <- "^[A-Za-z]+\\s+[0-9.a-z]+\\s+([A-Z0-9]+)\\s*.*"

df %>% 
    extract(model, "chassis", regx, remove=FALSE)
like image 24
Tyler Rinker Avatar answered Oct 09 '22 18:10

Tyler Rinker