Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

substring characters from a column in a data.table in R

Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?

I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.

> srcDT<- data.table(
    alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
    degree=c("W72","WG95","W88")
    )

> srcDT
               alum degree
1:      Paul Lennon    W72
2:  Stevadora Nicks   WG95
3:     Fred Murcury    W88

I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"

No problem:

> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]

> srcDT
                alum degree degree_year
 1:      Paul Lennon    W72          72
 2:  Stevadora Nicks   WG95          95
 3:     Fred Murcury    W88          88

If only it were always that simple. The problem is, the degree strings only sometimes look like the above. More often, they look like this:

srcDT<- data.table(
  alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
  degree=c("W72 C73","WG95 L95","W88 WG90")
)

I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)

Here's how I solved it:

x <-srcDT$degree                     ##grab just the degree column
z <-character()                       ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
                                     ## define a vector of regex's, in the order
                                     ## I want them

for(i in 1:length(x)){               ## loop thru all elements in degree column
  matched=F                          ## at the start of the loop, reset flag to F
  for(j in 1:length(degree.grep.pattern)){
                                     ## loop thru all elements of the pattern vector

    if(length(grep(degree.grep.pattern[j],x[i]))>0){
                                     ## see if you get a match

      m <- regexpr(degree.grep.pattern[j],x[i])
                                     ## if you do, great! grab the index of the match
      y<-regmatches(x[i],m)          ## then subset down.  y will equal "WG95"
      matched=T                      ## set the flag to T
      break                          ## stop looping
    }
                                     ## if no match, go on to next element in pattern vector
  }

  if(matched){                       ## after finishing the loop, check if you got a match
    yr <- substr(y,nchar(y)-1,nchar(y))
                                     ## if yes, then grab the last 2 characters of it
  }else{
    #if you run thru the whole list and don't match any pattern at all, just
    # take the last two characters from the affilitation
    yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
  }
  z<-c(z,yr)                         ## add this result (95) to the character vector
}
srcDT$degree_year<-z                ## set the column to the results.

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

This works. 100% of the time. No errors, no mis-matches. The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.

Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."

Thoughts on improvement?

Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings. Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.

like image 490
Ben Adams Avatar asked Jan 26 '16 00:01

Ben Adams


People also ask

How do I extract a substring in R?

Extracting Substrings from a Character Vector in R Programming – substring() Function. substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.

How do I find a character in a string in R?

In R, we use the grepl() function to check if characters are present in a string or not. And the method returns a Boolean value, TRUE - if the specified sequence of characters are present in the string. FALSE - if the specified sequence of characters are not present in the string.


1 Answers

As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job

srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
#              alum   degree degree_year
# 1: Ringo Harrison  W72 C73          72
# 2:   Brian Wilson WG95 L95          95
# 3:   Mike Jackson W88 WG90          90
like image 144
David Arenburg Avatar answered Oct 05 '22 20:10

David Arenburg