I have a file in fasta format as in example below. I would like to extract entries from that file when sequence: 'CGTACG' occurs more than once.
>seq1
AAATTCCGTACGGGCCTCT
>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
>seq3
AATGCCAAACGTACGAACAT
In the example the output would be (as the sequence 'CGTACG' occurs twice):
>seq2
TGGAATCACAGCGGCGTACGCAGCGGCGGCTGCGGCCGTACGGCG
All you need is:
awk '/^>/{seq=$0} gsub(/CGTACG/,"&") > 1{print seq ORS $0}' file
You can use awk for this:
for file in *; do
[[ -f "$file" ]] || continue # skip if not a regular file
if ! awk -v seq=CGTACG '$0 ~ seq".*"seq {exit(1)}' "$file"; then
# the file has two or more occurrences of the string on the same line, process it
# more code
fi
done
awk looks for the string in each file and exits 1 as soon as it finds a line that has two or more occurrences of the string. if ! test makes sure that we pick up the file only when awk has an exit code of 1.
If we looking for more than one match on different lines, then:
for file in *; do
[[ -f "$file" ]] || continue # skip if not a regular file
if ! awk -v seq=CGTACG '$0 ~ seq {x++; if(x>1) exit(1)}' "$file"; then
# the file has two or more occurrences of the string on different lines, process it
# more code
fi
done
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With