I'm trying to find a string pattern composed of the word CONCLUSION followed by the value of field $2 and field $3 from the same record in field $5.
For example, my_file.txt
is separated by "|":
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
2|substance3|substance4|red|Conclusions: Substance4 is not harmful...|
3|substance5|substance6|red|Substance5 interacts with substance6...|
So in this example I only want the first record to be printed because it has the word "CONCLUSIONS" followed by substance1
followed by substance2
.
This is what I'm trying but it's not working:
awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt
Any help is much appreciated
In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns. The syntax for using regular expressions to match lines in awk is: word ~ / match /
In awk, the format is $target ~ /$regex/, so $1 ~ / [A-Za-z]/. Also, in awk, the $ sign is used to mark fields, not variables. So $counter will be evaluated to the field number of counter. If counter is 2, then $counter will be the value of the second field. And the -gt is also not an awk thing.
The general syntax of awk is: Where 'script' is a set of commands that are understood by awk and are execute on file, filename. It works by reading a given line in the file, makes a copy of the line and then executes the script on the line.
Take for example the set [al1], here awk will match all strings containing character a or l or 1 in a line in the file /etc/hosts. The next example matches strings starting with either K or k followed by T: All the line from the file /etc/hosts contain at least a single number [0-9] in the above example.
$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
BEGIN{FS="|";IGNORECASE=1}
This part is unchanged from the code in the question.
$5 ~ "conclusions.*" $2 ".*" $3
This is a condition: it is true if $5
matches a regex composed of four strings concatenated together: "conclusions.*"
, and $2
, and ".*"
, and $3
.
We have specified no action for this condition. Consequently, if the condition is true, awk
performs the default action which is to print the line.
Consider:
$ echo "aa aa" | awk '$2 ~ /$1/'
This line prints nothing because awk
does not substitute in for variables inside a regex.
Observe that no match is found here either:
$ echo '$1' | awk '$0 ~ /$1/'
There is no match here because, inside a regex,$
matches only at the end of a line. So, /$1/
would only match the end of a line followed by a 1
. If we want to get a match here, we need to escape the dollar sign:
$ echo '$1' | awk '$0 ~ /\$1/'
$1
To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:
$ echo "aa aa" | awk '$2 ~ $1'
aa aa
This does successfully yield a match.
As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\>
to limit substance matches to whole words. Thus:
awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt
In this way, substance1
will not match substance10
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With