Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lookaround regular expression pattern in R

I am stuck on creating the right regular expression pattern that will split the content of my data frame columns without making me loose any of the elements. I have to use the separate() function from the tidyr package as this is part of a longer processing pipeline. Since I don't want to loose any of the elements in the string, I am developing a lookahead/lookbehind expression.

The strings that need to be split can follow one of the following patterns:

  • only letters (e.g. 'abcd')
  • letters-dash-numbers (e.g. 'abcd-123')
  • letters-numbers (e.g. 'abcd1234')
    The column content should be split into 3 columns max, one column per group.

I would like to split every time the element changes, so after the letters and after the dash. There can be one or more letters, one or more numbers, but only ever one dash. Strings that only contain letters, don't need to be split.

Here is what I have tried:

library(tidyr) 
myDat = data.frame(drugName = c("ab-1234", 'ab-1234', 'ab-1234',
                                'placebo', 'anotherdrug', 'andanother',
                                'xyz123', 'xyz123', 'placebo', 'another',
                                'omega-3', 'omega-3', 'another', 'placebo'))
drugColNames = paste0("X", 1:3) 

# This pattern doesn't split strings that only consist of number and letters, e.g. "xyz123" is not split after the letters.
pat = '(?=-[0-9+])|(?<=[a-z+]-)'

# This pattern splits at all the right places, but the last group (the numbers), is separated and not kept together.
# pat = '(?=-[0-9+]|[0-9+])|(?<=[a-z+]-)'

splitDat = separate(myDat, drugName,
         into = drugColNames,
         sep = pat)

The output from the splitting should be:

"ab-1234" --> "ab" "-" "123"
"xyz123" --> "xyz" "123"
"omega-3" --> "omega" "-" "3"

Thanks a lot for helping out in this. :)

like image 790
damico Avatar asked Nov 17 '20 09:11

damico


2 Answers

It would be easier to use extract here since we don't have a fixed separator which will also avoid using regex lookarounds.

tidyr::extract(myDat, drugName, drugColNames, '([a-z]+)(-)?(\\d+)?', remove = FALSE)

#      drugName          X1 X2   X3
#1      ab-1234          ab  - 1234
#2      ab-1234          ab  - 1234
#3      ab-1234          ab  - 1234
#4      placebo     placebo        
#5  anotherdrug anotherdrug        
#6   andanother  andanother        
#7       xyz123         xyz     123
#8       xyz123         xyz     123
#9      placebo     placebo        
#10     another     another        
#11     omega-3       omega  -    3
#12     omega-3       omega  -    3
#13     another     another        
#14     placebo     placebo        
like image 131
Ronak Shah Avatar answered Oct 18 '22 13:10

Ronak Shah


You may use

> extract(myDat, "drugName",drugColNames, "^([[:alpha:]]+)(\\W*)(\\d*)$", remove=FALSE)
      drugName          X1 X2   X3
1      ab-1234          ab  - 1234
2      ab-1234          ab  - 1234
3      ab-1234          ab  - 1234
4      placebo     placebo        
5  anotherdrug anotherdrug        
6   andanother  andanother        
7       xyz123         xyz     123
8       xyz123         xyz     123
9      placebo     placebo        
10     another     another        
11     omega-3       omega  -    3
12     omega-3       omega  -    3
13     another     another        
14     placebo     placebo        
> 

The regex used to extract data is

^([[:alpha:]]+)(\W*)(\d*)$

See the regex demo.

Details

  • ^ - start of string
  • ([[:alpha:]]+) - Group 1 (Column X1): one or more letters
  • (\W*) - Group 2 (Column X2): one or more non-word chars
  • (\d*) - Group 3 (Column X3): one or more digits
  • $ - end of string.

To remove the original column, remove remove=FALSE.

like image 27
Wiktor Stribiżew Avatar answered Oct 18 '22 12:10

Wiktor Stribiżew