Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find files having more than one occurrence of a pattern on the same line

Tags:

unix

awk

I have a file in fasta format as in example below. I would like to extract entries from that file when sequence: 'CGTACG' occurs more than once.

>seq1
AAATTCCGTACGGGCCTCT
>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
>seq3
AATGCCAAACGTACGAACAT

In the example the output would be (as the sequence 'CGTACG' occurs twice):

>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
like image 737
fattel Avatar asked Jan 27 '26 01:01

fattel


2 Answers

All you need is:

awk '/^>/{seq=$0} gsub(/CGTACG/,"&") > 1{print seq ORS $0}' file
like image 138
Ed Morton Avatar answered Jan 31 '26 09:01

Ed Morton


You can use awk for this:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq".*"seq {exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on the same line, process it
        # more code
    fi
done

awk looks for the string in each file and exits 1 as soon as it finds a line that has two or more occurrences of the string. if ! test makes sure that we pick up the file only when awk has an exit code of 1.

If we looking for more than one match on different lines, then:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq {x++; if(x>1) exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on different lines, process it
        # more code
    fi
done
like image 43
codeforester Avatar answered Jan 31 '26 08:01

codeforester



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!