Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a regex with Awk to extract the substring between parentheses?

In the following Bash command line, I am able to obtain the index for the substring, when the substring is between double quotes.

text='123ABCabc((XYZabc((((((abc123(((123'

echo $text | awk '{ print index($0, "((((a" )}'  # 20 is the result.

However, in my application, I will not know what character will be where the "a" is in this example. Therefore, I thought I could replace the "a" with a regex that accepted any character other than "(". I thought that /[^(}/ would be what I needed. However, I have been unable to get the Awk index command to work with any form of regex in place of the "((((a" in the example.

UPDATE: It was pointed out by William Pursell that the index operation does not accept a regex as the second operand.

Ultimately, what I was trying to accomplish was to extract the substring that was located after four or more "(", followed by one or more ")". Dennis Williamson provided the solution with the following code:

echo 'dksjfkdj(((((((I-WANT-THIS-SUBSTRING)askdjflsdjf' | 
mawk '{match($0,/\(\(\(\([^()]*\)/); s = substr($0,RSTART, RLENGTH); gsub(/[()]/, "", s); print s}'

Thanks to all for their help!

like image 467
GaryH. Avatar asked May 31 '12 15:05

GaryH.


People also ask

How do I use substr in awk?

One of them, which is called substr, can be used to select a substring from the input. Here is its syntax: substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional, in which case it means up to the end of the string.

Can we use regular expressions with awk command?

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.

How do you put text in between brackets?

Extract Text Between Parenthesis To extract the text between any characters, use a formula with the MID and FIND functions. The FIND Function locates the parenthesis and the MID Function returns the characters in between them.

What is pattern matching in awk?

Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record.


2 Answers

To get the position of the first non-open-parenthesis after a sequence of them:

$ echo "$text" | awk '{ print match($0, /\(\(\(\(([^(])/, arr); print arr[1, "start"]}'
20
24

This show the position of the substring "((([^(]" (20) and the position of the character after the parentheses (24).

The ability to do this with match() is a GNU (gawk) extension.

Edit:

echo 'dksjfkdj(((((((I-WANT-THIS-SUBSTRING)askdjflsdjf' | 
    mawk '{match($0,/\(\(\(\([^()]*\)/); s = substr($0,RSTART, RLENGTH); gsub(/[()]/, "", s); print s}'
like image 123
Dennis Williamson Avatar answered Oct 22 '22 17:10

Dennis Williamson


If you want to match four or more open-parentheses in order to find the start of yet another substring within the match, you actually have to calculate the value.

# Use GNU AWK to index the character after the end of a substring.
echo "$text" |
awk --re-interval 'match( $0, /\({4,}/ ) { print RSTART + RLENGTH }'

This should give you the correct starting index of the character following the sequence of parentheses, which in this case is 24.

like image 2
Todd A. Jacobs Avatar answered Oct 22 '22 19:10

Todd A. Jacobs