AWK use value of field in regex

Tags:

awk

I'm trying to find a string pattern composed of the word CONCLUSION followed by the value of field $2 and field $3 from the same record in field $5.

For example, my_file.txt is separated by "|":

1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|
2|substance3|substance4|red|Conclusions: Substance4 is not harmful...|
3|substance5|substance6|red|Substance5 interacts with substance6...|

So in this example I only want the first record to be printed because it has the word "CONCLUSIONS" followed by substance1 followed by substance2.

This is what I'm trying but it's not working:

awk 'BEGIN{FS="|";IGNORECASE=1}{if ($5 ~ /CONCLUSIONS.*$2.*$3/) {print $0}}' my_file.txt

Any help is much appreciated

469

asked Feb 20 '15 02:02

1 Answers

$ awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*" $2 ".*" $3' my_file.txt
1|substance1|substance2|red|CONCLUSIONS: the effect of SUBSTANCE1 and SUBSTANCE2 in humans...|

How It Works

BEGIN{FS="|";IGNORECASE=1}

This part is unchanged from the code in the question.
$5 ~ "conclusions.*" $2 ".*" $3

This is a condition: it is true if $5 matches a regex composed of four strings concatenated together: "conclusions.*", and $2, and ".*", and $3.

We have specified no action for this condition. Consequently, if the condition is true, awk performs the default action which is to print the line.

Simpler Examples

Consider:

$ echo "aa aa" | awk '$2 ~ /$1/'

This line prints nothing because awk does not substitute in for variables inside a regex.

Observe that no match is found here either:

$ echo '$1' | awk '$0 ~ /$1/'

There is no match here because, inside a regex,$ matches only at the end of a line. So, /$1/ would only match the end of a line followed by a 1. If we want to get a match here, we need to escape the dollar sign:

$ echo '$1' | awk '$0 ~ /\$1/'
$1

To get a regex that uses awk variables, we can, as is the basis for this answer, do the following:

$ echo "aa aa" | awk '$2 ~ $1'
aa aa

This does successfully yield a match.

A Further Improvement

As Ed Morton suggests in the comments, it might be important to insist that the substances match only on whole words. In that case, we can use \\<...\\> to limit substance matches to whole words. Thus:

awk 'BEGIN{FS="|";IGNORECASE=1} $5 ~ "conclusions.*\\<" $2 "\\>.*\\<" $3 "\\>"' my_file.txt

In this way, substance1 will not match substance10.

173

answered Sep 20 '22 18:09

John1024

Related questions
                            
                                Regex to extract US zip codes but not faux codes
                            
                                git smart line and word diff
                            
                                Regex to match several strings but not specific ones
                            
                                filters with wildcards in angularjs
                            
                                R removing unicode linebreaks
                            
                                Python Regular Expression for SIP URI variables?
                            
                                Java won't match .*
                            
                                Extracting a string using SQL PATINDEX, substring of varying sizes
                            
                                Why does [\W_]+ with i modifier in Javascript regex match i,k,s?
                            
                                the requested URL was not found on this server django
                            
                                forming and using Regular expressions in R
                            
                                Finding the recurring pattern
                            
                                Scrapy Extract number from page text with regex
                            
                                Why regular expression for cyrillic letters misses a letter? [duplicate]
                            
                                Regular Expressions slowing down the program
                            
                                Matcher's appendReplacement method ignores the replacement's backslashes
                            
                                Regex Go mismatch [duplicate]
                            
                                How to get use ng-pattern
                            
                                Why is this regex using lookbehinds invalid in R?
                            
                                Regular Expression for validating DNS label ( host name)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWK use value of field in regex

Tags:

regex

awk

Hallucigeniak

People also ask