Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK use value of field in regex

Tags:

regex

awk

I'm trying to find a string pattern composed of the word CONCLUSION followed by the value of field $2 and field $3 from the same record in field $5.

For example, my_file.txt is separated by "|":

1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
2|substance3|substance4|red|Conclusions: Substance4 is not harmful...|
3|substance5|substance6|red|Substance5 interacts with substance6...|

So in this example I only want the first record to be printed because it has the word "CONCLUSIONS" followed by substance1 followed by substance2.

This is what I'm trying but it's not working:

awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt

Any help is much appreciated

like image 469
Hallucigeniak Avatar asked Feb 20 '15 02:02

Hallucigeniak


People also ask

How to use regular expressions in AWK?

In awk, regular expressions (regex) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns. The syntax for using regular expressions to match lines in awk is: word ~ / match /

How to get the value of a field in AWK?

In awk, the format is $target ~ /$regex/, so $1 ~ / [A-Za-z]/. Also, in awk, the $ sign is used to mark fields, not variables. So $counter will be evaluated to the field number of counter. If counter is 2, then $counter will be the value of the second field. And the -gt is also not an awk thing.

What is the general syntax of AWK?

The general syntax of awk is: Where 'script' is a set of commands that are understood by awk and are execute on file, filename. It works by reading a given line in the file, makes a copy of the line and then executes the script on the line.

How do I match a string in AWK?

Take for example the set [al1], here awk will match all strings containing character a or l or 1 in a line in the file /etc/hosts. The next example matches strings starting with either K or k followed by T: All the line from the file /etc/hosts contain at least a single number [0-9] in the above example.


1 Answers

$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How It Works

  • BEGIN{FS="|";IGNORECASE=1}

    This part is unchanged from the code in the question.

  • $5 ~ "conclusions.*" $2 ".*" $3

    This is a condition: it is true if $5 matches a regex composed of four strings concatenated together: "conclusions.*", and $2, and ".*", and $3.

    We have specified no action for this condition. Consequently, if the condition is true, awk performs the default action which is to print the line.

Simpler Examples

Consider:

$ echo "aa aa" | awk '$2 ~ /$1/'

This line prints nothing because awk does not substitute in for variables inside a regex.

Observe that no match is found here either:

$ echo '$1' | awk '$0 ~ /$1/'

There is no match here because, inside a regex,$ matches only at the end of a line. So, /$1/ would only match the end of a line followed by a 1. If we want to get a match here, we need to escape the dollar sign:

$ echo '$1' | awk '$0 ~ /\$1/'
$1

To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:

$ echo "aa aa" | awk '$2 ~ $1'
aa aa

This does successfully yield a match.

A Further Improvement

As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\> to limit substance matches to whole words. Thus:

awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

In this way, substance1 will not match substance10.

like image 173
John1024 Avatar answered Sep 20 '22 18:09

John1024