Find files having more than one occurrence of a pattern on the same line

Question

I have a file in fasta format as in example below. I would like to extract entries from that file when sequence: 'CGTACG' occurs more than once.

>seq1
AAATTCCGTACGGGCCTCT
>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
>seq3
AATGCCAAACGTACGAACAT

In the example the output would be (as the sequence 'CGTACG' occurs twice):

>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG

Ed Morton · Accepted Answer

All you need is:

awk '/^>/{seq=$0} gsub(/CGTACG/,"&") > 1{print seq ORS $0}' file

codeforester · Answer

You can use awk for this:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq".*"seq {exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on the same line, process it
        # more code
    fi
done

awk looks for the string in each file and exits 1 as soon as it finds a line that has two or more occurrences of the string. if ! test makes sure that we pick up the file only when awk has an exit code of 1.

If we looking for more than one match on different lines, then:

for file in *; do
    [[ -f "$file" ]] || continue # skip if not a regular file
    if ! awk -v seq=CGTACG '$0 ~ seq {x++; if(x>1) exit(1)}' "$file"; then
        # the file has two or more occurrences of the string on different lines, process it
        # more code
    fi
done

Find files having more than one occurrence of a pattern on the same line

Tags:

unix

awk

fattel

2 Answers

Ed Morton

codeforester

Recent Activity

Donate For Us

Find files having more than one occurrence of a pattern on the same line

Tags:

unix

awk

fattel

2 Answers

Ed Morton

codeforester

Related questions

Recent Activity

Donate For Us