Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use RegEx in R to retrieve string before second occurence of a period ('.')

Tags:

regex

r

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:

v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")

I would like to have returned this:

[1] "m_s.E1" "m_xs.P1"

like image 921
user3375672 Avatar asked Mar 19 '23 06:03

user3375672


2 Answers

> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1"  "m_xs.P1"

Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.

The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.

This is a good use case for "character classes" which are described in the help page found by typing:

?regex
like image 180
IRTFM Avatar answered Mar 26 '23 04:03

IRTFM


Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:

library(qdap)
beg2char(v, ".", 2)

## [1] "m_s.E1"  "m_xs.P1"
like image 22
Tyler Rinker Avatar answered Mar 26 '23 04:03

Tyler Rinker