What is the proper awk
syntax to match multiple patterns in one column? Having a columnar file like this:
c11 c21 c31
c12 c22 c32
c13 c23 c33
how to exclude lines that match c21 and c22 in the second column.
With grep
, one can do something like this (but it doesn't specify to match in the second column only):
> egrep -w -v "c21|c22" bar.txt
c13 c23 c33
I tried playing with awk
but to no avail:
> awk '$2 != /c21|c22/' bar.txt
c11 c21 c31
c12 c22 c32
c13 c23 c33
> awk '$2 != "c21" || $2 != "c22"' bar.txt
c11 c21 c31
c12 c22 c32
c13 c23 c33
So, what is the proper awk
syntax to get this right?
$2 != /c21|c22/
is shorthand for
$2 != ($0 ~ /c21|c22/)
which is comparing $2
to the result of comparing $0 to c21 or c22 and that result is either 1 or 0 so it's testing for $2
having a value other than 1
.
$2 != "c21" || $2 != "c22"
is testing for $2
not equal to c21
or $2
not equal to c22
which is a condition that is always true. Think about it - if $2 is c21 then the first condition ($2 != "c21"
) is false but then the second condition ($2 != "c22"
) is true and so on so the or
is always true for any value of $2
What you're trying to write is:
awk '$2 !~ /c21|c22/'
or more robustly:
awk '$2 !~ /^(c21|c22)$/'
and more briefly (plus just as robustly) the way to REALLY write that condition is:
awk '$2 !~ /^c2[12]$/'
and if you wanted to do a string rather than regexp comparison then you'd do either of these if it's a throwaway script (I favor the first for fewer negation signs which IMHO makes it clearer):
awk '!($2 == "c21" || $2 == "c22")'
awk '$2 != "c21" && $2 != "c22"'
and this otherwise:
awk 'BEGIN{split("c21 c22",t); for (i in t) vals[t[i]]} !($2 in vals)'
That last is best since you only specify $2
once and you can just add other values to the string being split if you need to test more which means you can't break the comparison ogic later in the script.
Use and (&&
) instead of or (||
):
awk '$2 != "c21" && $2 != "c22"' bar.txt
Prints:
c13 c23 c33
Since c21 doesn't equal c22, lines with c21 in column 2 will be printed in the version with ||
because $2 doesn't equal c22
and vice versa for lines with c22. In fact, it would be impossible for not all the lines to be printed because in no line can column 2 equal both c21 and c22.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With