substring characters from a column in a data.table in R

Tags:

Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?

I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.

> srcDT<- data.table(
    alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
    degree=c("W72","WG95","W88")
    )

> srcDT
               alum degree
1:      Paul Lennon    W72
2:  Stevadora Nicks   WG95
3:     Fred Murcury    W88

I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"

No problem:

> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]

> srcDT
                alum degree degree_year
 1:      Paul Lennon    W72          72
 2:  Stevadora Nicks   WG95          95
 3:     Fred Murcury    W88          88

If only it were always that simple. The problem is, the degree strings only sometimes look like the above. More often, they look like this:

srcDT<- data.table(
  alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
  degree=c("W72 C73","WG95 L95","W88 WG90")
)

I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)

Here's how I solved it:

x <-srcDT$degree                     ##grab just the degree column
z <-character()                       ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
                                     ## define a vector of regex's, in the order
                                     ## I want them

for(i in 1:length(x)){               ## loop thru all elements in degree column
  matched=F                          ## at the start of the loop, reset flag to F
  for(j in 1:length(degree.grep.pattern)){
                                     ## loop thru all elements of the pattern vector

    if(length(grep(degree.grep.pattern[j],x[i]))>0){
                                     ## see if you get a match

      m <- regexpr(degree.grep.pattern[j],x[i])
                                     ## if you do, great! grab the index of the match
      y<-regmatches(x[i],m)          ## then subset down.  y will equal "WG95"
      matched=T                      ## set the flag to T
      break                          ## stop looping
    }
                                     ## if no match, go on to next element in pattern vector
  }

  if(matched){                       ## after finishing the loop, check if you got a match
    yr <- substr(y,nchar(y)-1,nchar(y))
                                     ## if yes, then grab the last 2 characters of it
  }else{
    #if you run thru the whole list and don't match any pattern at all, just
    # take the last two characters from the affilitation
    yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
  }
  z<-c(z,yr)                         ## add this result (95) to the character vector
}
srcDT$degree_year<-z                ## set the column to the results.

> srcDT
             alum   degree degree_year
1: Ringo Harrison  W72 C73          72
2:   Brian Wilson WG95 L95          95
3:   Mike Jackson W88 WG90          90

This works. 100% of the time. No errors, no mis-matches. The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.

Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."

Thoughts on improvement?

Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings. Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.

490

asked Jan 26 '16 00:01

Ben Adams

1 Answers

As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job

srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
#              alum   degree degree_year
# 1: Ringo Harrison  W72 C73          72
# 2:   Brian Wilson WG95 L95          95
# 3:   Mike Jackson W88 WG90          90

144

answered Oct 05 '22 20:10

David Arenburg

Related questions
                            
                                Split string on un-escaped character in D
                            
                                R - gsub a specific character of a specific position
                            
                                sed replace content within double quotes
                            
                                Clean xml ==> Remove line if any empty tags
                            
                                My RegExp pattern alllowing double "@" in email
                            
                                Replace a double backslash followed by quote (\\') using sed?
                            
                                Python Regex-- TypeError: an integer is required
                            
                                Regex to wrap strings with HTML tags
                            
                                Using replace() replaces too much content
                            
                                R: Find and remove all one to two letter words
                            
                                Regular expression matching inside dplyr
                            
                                Split by suffix with Python regular expression
                            
                                Regex split while reading from file
                            
                                Parsing date in Mon, DD, YYYY format using RegEx in R
                            
                                regex match substring unless another substring matches
                            
                                regular expression to match whole word in mongodb
                            
                                Python: regex to make a python dictionary out of a sequence of words?
                            
                                Perl: Regex to get all text between repeating patterns
                            
                                Remove first character from a string if it is 0
                            
                                In Qt, what takes the least amount of code to replace string matches with regular expression captures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

substring characters from a column in a data.table in R

Tags:

regex

r

data.table

Ben Adams

People also ask

1 Answers

David Arenburg

Recent Activity

Donate For Us