Remove duplicated sequences in FASTA with Python

Question

I apologize if the question has been asked before, but I have been searching for days and could not find a solution in Python.

I have a large fasta file, containing headers and sequences.

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

>cavPor3_rmsk_tRNA-Ser-TCA(m) range=chrM:6875-6940 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

This is a very small fragment of what the file looks like. I want to keep only the first entry (header and sequence) if, as you can see for the last two entries, the sequences are the same.

The output would look like this:

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

The problem is that the FASTA file is over one gigabyte in size. I have found ways of solving this by removing duplicates based on duplicate IDs or by using bash, but sadly I can't do this on my computer. This task is for a research project, not a homework or task.

Thank you in advance for your help!

pippo1980 · Accepted Answer

this copied from here: Remove Redundant Sequences from FASTA file in Python

uses Biopython, but works with fasta file where headers are of the:

'> header' type see FAsta Format Wiki

from Bio import SeqIO
import time

start = time.time() 

seen = []
records = []

for record in SeqIO.parse("INPUT-FILE", "fasta"):  
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, "OUTPUT-FILE", "fasta")
end = time.time()

print(f"Run time is {(end- start)/60}")

faster as suggested by MattMDo using a set istead of a list:

seen = set()
records = []

for record in SeqIO.parse("b4r2.fasta", "fasta"):  
    if record.seq not in seen:
        seen.add(record.seq)
        records.append(record)

I've got a longer one that uses argparser but its slower because counts the sequences can post it if needed

Remove duplicated sequences in FASTA with Python

Tags:

python

duplicates

biopython

fasta

Marco Badici

1 Answers

pippo1980

Recent Activity

Donate For Us

Remove duplicated sequences in FASTA with Python

Tags:

python

duplicates

biopython

fasta

Marco Badici

1 Answers

pippo1980

Related questions

Recent Activity

Donate For Us