What is a correct regular expression to extract the string "(procedure)" -or in general text from inside the parenthesis - from the strings below
input string examples are
Positron emission tomography using flutemetamol (18F) with computed tomography of brain (procedure)
another example
Urinary tract infection prophylaxis (procedure)
Possible approaches are:
Go to end of the text, and look for first opening parenthesis and take subset from that position to the end of the text
from beginning of text, identify last '(' char and do that position to end as substring
Other strings can be (different "tag" is extracted)
[1] "Xanthoma of eyelid (disorder)" "Ventricular tachyarrhythmia (disorder)"
[3] "Abnormal urine odor (finding)" "Coloboma of iris (disorder)"
[5] "Macroencephaly (disorder)" "Right main coronary artery thrombosis (disorder)"
(general regex is sought) (or a solution in R is even better)
If it is the last part of the string then this regex will do it:
/\(([^()]*)\)$/
Explaination: Look for an open (
and match everything in between it that isn't (
or )
and then has a )
at the end of the string.
https://regex101.com/r/cEsQtf/1
sub can do that with the right regex
Text = c("Positron emission tomography using flutemetamol (18F)
with computed tomography of brain (procedure)",
"Urinary tract infection prophylaxis (procedure)",
"Xanthoma of eyelid (disorder)",
"Ventricular tachyarrhythmia (disorder)",
"Abnormal urine odor (finding)",
"Coloboma of iris (disorder)",
"Macroencephaly (disorder)",
"Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder" "disorder" "finding" "disorder"
[7] "disorder" "disorder"
Addendum: Detailed explanation of the regex
The question asks to find the content of the final set of parentheses in the strings. This expression is slightly confusing because it includes two different uses of parentheses, One is to represent parentheses in the string being processed and the other is to set up a "capturing group", the way that we specify what part should be returned by the expression. The expression is made up of five basic units:
1. Initial .* - matches everything up to the final open parenthesis.
Note that this is relying on "greedy matching"
2. \\( ... \\) - matches the final set of parentheses.
Because ( by itself means something else, we need to "escape" the
parentheses by preceding them with \. That is we want the regular
expression to say \( ... \). However, the way R interprets strings,
if we just typed \( and \), R would interpret the \ as escaping the (
and so interpret this as just ( ... ). So we escape the backslash.
R will interpret \\( ... \\) as \( ... \) meaning the literal
characters ( & ).
3. ( ... ) Inside the pair in part 2
This is making use of the special meaning of parentheses. When we
enclose an expression in parentheses, whatever value is inside them
will be stored in a variable for later use. That variable is called
\1, which is what was used in the substitution pattern. Again, is
we just wrote \1, R would interpret it as if we were trying to escape
the 1. Writing \\1 is interpreted as the character \ followed by 1,
i.e. \1.
4. Central .* Inside the pair in part 3
This is what we are looking for, all characters inside the parentheses.
5. Final .*
This is in the expression to match any characters that may follow the
final set of parentheses.
The sub function will use this to replace the matched pattern (in this case, all characters in the string) with the substitution pattern \1 i.e. the contents of the variable containing whatever was in the first (in our case only) capturing group - the stuff inside the final parentheses.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With