Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?
I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.
> srcDT<- data.table(
alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
degree=c("W72","WG95","W88")
)
> srcDT
alum degree
1: Paul Lennon W72
2: Stevadora Nicks WG95
3: Fred Murcury W88
I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"
No problem:
> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]
> srcDT
alum degree degree_year
1: Paul Lennon W72 72
2: Stevadora Nicks WG95 95
3: Fred Murcury W88 88
If only it were always that simple. The problem is, the degree strings only sometimes look like the above. More often, they look like this:
srcDT<- data.table(
alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
degree=c("W72 C73","WG95 L95","W88 WG90")
)
I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)
Here's how I solved it:
x <-srcDT$degree ##grab just the degree column
z <-character() ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
## define a vector of regex's, in the order
## I want them
for(i in 1:length(x)){ ## loop thru all elements in degree column
matched=F ## at the start of the loop, reset flag to F
for(j in 1:length(degree.grep.pattern)){
## loop thru all elements of the pattern vector
if(length(grep(degree.grep.pattern[j],x[i]))>0){
## see if you get a match
m <- regexpr(degree.grep.pattern[j],x[i])
## if you do, great! grab the index of the match
y<-regmatches(x[i],m) ## then subset down. y will equal "WG95"
matched=T ## set the flag to T
break ## stop looping
}
## if no match, go on to next element in pattern vector
}
if(matched){ ## after finishing the loop, check if you got a match
yr <- substr(y,nchar(y)-1,nchar(y))
## if yes, then grab the last 2 characters of it
}else{
#if you run thru the whole list and don't match any pattern at all, just
# take the last two characters from the affilitation
yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
}
z<-c(z,yr) ## add this result (95) to the character vector
}
srcDT$degree_year<-z ## set the column to the results.
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
This works. 100% of the time. No errors, no mis-matches. The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.
Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."
Thoughts on improvement?
Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings. Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
Extracting Substrings from a Character Vector in R Programming – substring() Function. substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.
In R, we use the grepl() function to check if characters are present in a string or not. And the method returns a Boolean value, TRUE - if the specified sequence of characters are present in the string. FALSE - if the specified sequence of characters are not present in the string.
As it seem (per OPs) comments, there is no situation of "WG W"
, then a simple regex solution should do the job
srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
# alum degree degree_year
# 1: Ringo Harrison W72 C73 72
# 2: Brian Wilson WG95 L95 95
# 3: Mike Jackson W88 WG90 90
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With