Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract substring between patterns "_" and "." in R [duplicate]

Tags:

r

gsub

I have many filenames which look like:

txt= "MA0051_IRF2.xml"

I want to extract IRF2 which is between "_" and ".". How do I do this in R?

like image 218
Paul.j Avatar asked May 07 '14 12:05

Paul.j


2 Answers

To achieve this, you need a regexp that

  • matches an (optional) arbitrary string in front of the _ : .*
  • matches a literal _ : [_]
  • matches everything up to (but not including) the next . and stores it in capturing group no. 1 : ([^.]+)
  • matches a literal . : [.]
  • matches an (optional) arbitrary string after the . : .*

In your call to gsub, you then

  • use the regular expression we built in the previous step
  • replace the whole string with the contents of the first capturing group: \\1 (we need to escape the backslash, hence the double backslash)

Example:

gsub(".*[_]([^.]+)[.].*", "\\1", "MA0051_IRF2.xml")
like image 83
Frank Schmitt Avatar answered Oct 23 '22 12:10

Frank Schmitt


an other possibility with the stringr package:

 str_extract(x, perl("(?<=_)(.+)(?=\\.)"))
like image 33
droopy Avatar answered Oct 23 '22 12:10

droopy