Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex: grep excluding hyphen/dash as boundary

Tags:

regex

r

I am trying to match an exact word in in a vector with variable strings. For this I am using boundaries. However, I would like for hyphen/dash not to be considered as word boundary. Here is an example:

vector<-c(    
"ARNT",
"ACF, ASP, ACF64",
"BID",
"KTN1, KTN",
"NCRNA00181, A1BGAS, A1BG-AS",
"KTN1-AS1")

To match strings that contain "KTN1" I am using:

grep("(?i)(?=.*\\bKTN1\\b)", vector, perl=T) 

But this matches both "KTN1" and "KTN1-AS1".

Is there a way I could treat the dash as a character so that "KTN1-AS1" is considered a whole word?

like image 954
user4451922 Avatar asked Mar 17 '23 15:03

user4451922


2 Answers

To match a particular word from an vector element, you need to use functions like regmatches , str_extract_all (from stringr package) not grep, since grep would return only the element index where the match is found.

> vector<-c(    
+     "ARNT",
+     "ACF, ASP, ACF64",
+     "BID",
+     "KTN1, KTN",
+     "NCRNA00181, A1BGAS, A1BG-AS",
+     "KTN1-AS1")
> regmatches(vector, regexpr("(?i)\\bKTN1[-\\w]*\\b", vector, perl=T))
[1] "KTN1"     "KTN1-AS1"

OR

> library(stringr)
> unlist(str_extract_all(vector[grep("(?i)\\bKTN1[-\\w]*\\b", vector)], perl("(?i).*\\bKTN1[-\\w]*\\b")))
[1] "KTN1"     "KTN1-AS1"

Update:

> grep("\\bKTN1(?=$|,)", vector, perl=T, value=T)
[1] "KTN1, KTN"

Returns the element which contain the string KTN1 followed by a comma or end of the line.

OR

> grep("\\bKTN1\\b(?!-)", vector, perl=T, value=T)
[1] "KTN1, KTN"

Returns the element which contain the string KTN1 not followed by a hyphen.

like image 52
Avinash Raj Avatar answered Mar 19 '23 03:03

Avinash Raj


I would keep this simple and create a DIY Boundary.

grep('(^|[^-\\w])KTN1([^-\\w]|$)', vector, ignore.case = TRUE)

We use a capture group to define the boundaries. We match a character that is not a hyphen or a word character — beginning or end of string, which is closer to the intent of the \b boundary .

like image 40
hwnd Avatar answered Mar 19 '23 03:03

hwnd