Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk match multiple pattern in column

Tags:

awk

What is the proper awk syntax to match multiple patterns in one column? Having a columnar file like this:

c11 c21 c31
c12 c22 c32
c13 c23 c33

how to exclude lines that match c21 and c22 in the second column.

With grep, one can do something like this (but it doesn't specify to match in the second column only):

> egrep -w -v "c21|c22" bar.txt 
c13 c23 c33

I tried playing with awk but to no avail:

> awk '$2 != /c21|c22/' bar.txt 
c11 c21 c31
c12 c22 c32
c13 c23 c33

> awk '$2 != "c21" || $2 != "c22"' bar.txt 
c11 c21 c31
c12 c22 c32
c13 c23 c33

So, what is the proper awk syntax to get this right?

like image 616
PedroA Avatar asked Dec 14 '22 20:12

PedroA


2 Answers

$2 != /c21|c22/

is shorthand for

$2 != ($0 ~ /c21|c22/)

which is comparing $2 to the result of comparing $0 to c21 or c22 and that result is either 1 or 0 so it's testing for $2 having a value other than 1.

$2 != "c21" || $2 != "c22"

is testing for $2 not equal to c21 or $2 not equal to c22 which is a condition that is always true. Think about it - if $2 is c21 then the first condition ($2 != "c21") is false but then the second condition ($2 != "c22") is true and so on so the or is always true for any value of $2

What you're trying to write is:

awk '$2 !~ /c21|c22/'

or more robustly:

awk '$2 !~ /^(c21|c22)$/'

and more briefly (plus just as robustly) the way to REALLY write that condition is:

awk '$2 !~ /^c2[12]$/'

and if you wanted to do a string rather than regexp comparison then you'd do either of these if it's a throwaway script (I favor the first for fewer negation signs which IMHO makes it clearer):

awk '!($2 == "c21" || $2 == "c22")'
awk '$2 != "c21" && $2 != "c22"'

and this otherwise:

awk 'BEGIN{split("c21 c22",t); for (i in t) vals[t[i]]} !($2 in vals)'

That last is best since you only specify $2 once and you can just add other values to the string being split if you need to test more which means you can't break the comparison ogic later in the script.

like image 175
Ed Morton Avatar answered Jan 24 '23 08:01

Ed Morton


Use and (&&) instead of or (||):

awk '$2 != "c21" && $2 != "c22"' bar.txt 

Prints:

c13 c23 c33

Since c21 doesn't equal c22, lines with c21 in column 2 will be printed in the version with || because $2 doesn't equal c22 and vice versa for lines with c22. In fact, it would be impossible for not all the lines to be printed because in no line can column 2 equal both c21 and c22.

like image 23
Algorithmic Canary Avatar answered Jan 24 '23 08:01

Algorithmic Canary