My data(tab separated):
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0
...
how can i grep the lines with exact, for example, 5 '1's, ideal output:
1 0 0 1 0 1 1 0 1
Also, how can i grep lines with equal or more than (>=) 5 '1's, ideal output:
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
i tried,
grep 1$'\t'1$'\t'1$'\t'1$'\t'1
however this will only output consecutive '1's, which is not all i want.
i wonder if there will be any simple method to achieve this, thank you!
John Bollinger's helpful answer and anishane's answer show that it can be done with grep
, but, as has been noted, that is quite cumbersome, given that regular expression aren't designed for counting.
awk
, by contrast, is built for field-based parsing and counting (often combined with regular expressions to identify field separators, or, as below, the fields themselves).
Assuming you have GNU awk
, you can use the following:
Exactly 5 1
s:
awk -v FPAT='\\<1\\>' 'NF==5' file
5 or more 1
s:
awk -v FPAT='\\<1\\>' 'NF>=5' file
Special variable FPAT
is a GNU awk
extension that allows you to identify fields via a regex that describes the fields themselves, in contrast with the standard approach of using a regex to define the separators between fields (via special variable FS
or option -F
):
'\\<1\\>'
identifies any "isolated" 1
(surrounded by non-word characters) as a field, based on word-boundary assertions \<
and \>
; the \
must be doubled here so that the initial string parsing performed by awk
doesn't "eat" single \
s.Standard variable NF
contains the count of input fields in the line at hand, which allows easy numerical comparison. If the conditional evaluates to true, the input line at hand is implicitly printed (in other words: NF==5
is implicitly the same as NF==5 { print }
and, more verbosely, NF==5 { print $0 }
).
A POSIX-compliant awk
solution is a little more complicated:
Exactly 5 1
s:
awk '{ l=$0; gsub("[\t0]", "") }; length($0)==5 { print l }' file
5 or more 1
s:
awk '{ l=$0; gsub("[\t0]", "") }; length($0)>=5 { print l }' file
l=$0
saves the input line ($0
) in its original form in variable l
.
gsub("[\t0]", "")
replaces all \t
and 0
chars. in the input line with the empty string, i.e., effectively removes them, and only leaves (directly concatenated) 1
instances (if any).
length($0)==5 { print l }
then prints the original input line (l
) only if the resulting string of 1
s (i.e., the count of 1
s now stored in the modified input line ($0
)) matches the specified count.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With