In the following Bash command line, I am able to obtain the index for the substring, when the substring is between double quotes.
text='123ABCabc((XYZabc((((((abc123(((123'
echo $text | awk '{ print index($0, "((((a" )}' # 20 is the result.
However, in my application, I will not know what character will be where the "a" is in this example. Therefore, I thought I could replace the "a" with a regex that accepted any character other than "(". I thought that /[^(}/ would be what I needed. However, I have been unable to get the Awk index command to work with any form of regex in place of the "((((a" in the example.
UPDATE: It was pointed out by William Pursell that the index operation does not accept a regex as the second operand.
Ultimately, what I was trying to accomplish was to extract the substring that was located after four or more "(", followed by one or more ")". Dennis Williamson provided the solution with the following code:
echo 'dksjfkdj(((((((I-WANT-THIS-SUBSTRING)askdjflsdjf' |
mawk '{match($0,/\(\(\(\([^()]*\)/); s = substr($0,RSTART, RLENGTH); gsub(/[()]/, "", s); print s}'
Thanks to all for their help!
One of them, which is called substr, can be used to select a substring from the input. Here is its syntax: substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional, in which case it means up to the end of the string.
In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns.
Extract Text Between Parenthesis To extract the text between any characters, use a formula with the MID and FIND functions. The FIND Function locates the parenthesis and the MID Function returns the characters in between them.
Any awk expression is valid as an awk pattern. The pattern matches if the expression's value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record.
To get the position of the first non-open-parenthesis after a sequence of them:
$ echo "$text" | awk '{ print match($0, /\(\(\(\(([^(])/, arr); print arr[1, "start"]}'
20
24
This show the position of the substring "((([^(]" (20) and the position of the character after the parentheses (24).
The ability to do this with match()
is a GNU (gawk
) extension.
Edit:
echo 'dksjfkdj(((((((I-WANT-THIS-SUBSTRING)askdjflsdjf' |
mawk '{match($0,/\(\(\(\([^()]*\)/); s = substr($0,RSTART, RLENGTH); gsub(/[()]/, "", s); print s}'
If you want to match four or more open-parentheses in order to find the start of yet another substring within the match, you actually have to calculate the value.
# Use GNU AWK to index the character after the end of a substring.
echo "$text" |
awk --re-interval 'match( $0, /\({4,}/ ) { print RSTART + RLENGTH }'
This should give you the correct starting index of the character following the sequence of parentheses, which in this case is 24.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With