Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could someone please explain these gsub arguments precisely?

Tags:

regex

r

I have this code for trucating strings after an underscore "_" is found, but I don't understand the operators/arguments that were passed through gsub to make this manipulation possible. In particular, why I should have to gsub "\\1" instead of "". I do note that the output of gsubbing nothing removes the entire string. I am also a bit confused by how the operators are being used, particularly parantheses and brackets:

AAA <- "ATGAS_1121"
(aa <- gsub("([^_]*).*", "\\1", AAA))
## [1] "ATGAS"

Please note, this post draws heavily from: R remove part of string

Thanks, I appreciate it.

like image 620
Edward Tyler Avatar asked Feb 10 '23 20:02

Edward Tyler


1 Answers

In regex (..) called capturing group which captures all the characters matched by the pattern present inside that group. You could refer those characters by back-referencing the group index number.

gsub("([^_]*).*", "\\1", AAA)

([^_]*) captures all the characters at the start but not of _ zero or more times. Following .* matches all the remaining characters. gsub will replace all the matched characters with the chars in the replacement part. If your code is like,

gsub("([^_]*).*", "", AAA)

it would remove all the characters, since we matched all the characters but captured only those characters(not of _ symbol) which are present at the start. So by replacing the matched characters with the chars present inside the group index 1, will give you the part before _ symbol.

You could achieve the same result using \K

> gsub("[^_]*\\K.*", "", AAA, perl = TRUE)
[1] "ATGAS"

Since \K is a PCRE feature, you must need to enable perl=TRUE parameter. \K keeps the text matched so far out of the overall regex match.

like image 125
Avinash Raj Avatar answered May 20 '23 15:05

Avinash Raj