Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: How to extract text from last parenthesis

Tags:

regex

r

What is a correct regular expression to extract the string "(procedure)" -or in general text from inside the parenthesis - from the strings below

input string examples are

Positron emission tomography using flutemetamol (18F) with computed tomography of brain (procedure)

another example

Urinary tract infection prophylaxis (procedure)

Possible approaches are:

  • Go to end of the text, and look for first opening parenthesis and take subset from that position to the end of the text

  • from beginning of text, identify last '(' char and do that position to end as substring

Other strings can be (different "tag" is extracted)

[1] "Xanthoma of eyelid (disorder)"                    "Ventricular tachyarrhythmia (disorder)"          
[3] "Abnormal urine odor (finding)"                    "Coloboma of iris (disorder)"                     
[5] "Macroencephaly (disorder)"                        "Right main coronary artery thrombosis (disorder)"

(general regex is sought) (or a solution in R is even better)

like image 347
userJT Avatar asked Oct 16 '25 05:10

userJT


2 Answers

If it is the last part of the string then this regex will do it:

/\(([^()]*)\)$/

Explaination: Look for an open ( and match everything in between it that isn't ( or ) and then has a ) at the end of the string.

https://regex101.com/r/cEsQtf/1

like image 150
Andy Avatar answered Oct 18 '25 19:10

Andy


sub can do that with the right regex

Text = c("Positron emission tomography using flutemetamol (18F) 
    with computed tomography of brain (procedure)",
    "Urinary tract infection prophylaxis (procedure)", 
    "Xanthoma of eyelid (disorder)",                    
    "Ventricular tachyarrhythmia (disorder)",          
    "Abnormal urine odor (finding)",                    
    "Coloboma of iris (disorder)",                   
    "Macroencephaly (disorder)",                        
    "Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder"  "disorder"  "finding"   "disorder" 
[7] "disorder"  "disorder"

Addendum: Detailed explanation of the regex
The question asks to find the content of the final set of parentheses in the strings. This expression is slightly confusing because it includes two different uses of parentheses, One is to represent parentheses in the string being processed and the other is to set up a "capturing group", the way that we specify what part should be returned by the expression. The expression is made up of five basic units:

1. Initial .*   - matches everything up to the final open parenthesis. 
   Note that this is relying on "greedy matching"
2. \\(   ...    \\)   - matches the final set of parentheses. 
   Because ( by itself means something else,  we need to "escape" the 
   parentheses by preceding them with \.  That is we want the regular
   expression to say   \(  ...  \).  However, the way R interprets strings,
   if we just typed \( and \),  R would interpret the \ as escaping the (
   and so interpret this as just ( ... ).  So we escape the backslash.  
   R will interpret   \\(  ... \\)      as \( ... \) meaning the literal
   characters ( & ). 
3. ( ... )       Inside the pair in part 2
   This is making use of the special meaning of parentheses.  When we
   enclose an expression in parentheses, whatever value is inside them 
   will be stored in a variable for later use. That variable is called 
   \1,  which is what was used in the substitution pattern. Again, is 
   we just wrote \1, R would interpret it as if we were trying to escape
   the 1. Writing \\1 is interpreted as the character \ followed by 1, 
   i.e. \1.
4. Central .*    Inside the pair in part 3
   This is what we are looking for,  all characters inside the parentheses.
5. Final   .*
   This is in the expression to match any characters that may follow the 
   final set of parentheses. 

The sub function will use this to replace the matched pattern (in this case, all characters in the string) with the substitution pattern \1 i.e. the contents of the variable containing whatever was in the first (in our case only) capturing group - the stuff inside the final parentheses.

like image 42
G5W Avatar answered Oct 18 '25 19:10

G5W



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!