I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code:
def complement(seq):
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
bases = list(seq)
bases = [complement[base] for base in bases]
return ''.join(bases)
def reverse_complement(s):
return complement(s[::-1])
print "Reverse Complement:"
print(reverse_complement("TCGGGCCC"))
However, when I try to find the item which is not present in the complement dictionary, using the code below, I just get the complement of the last base. It doesn't iterate. I'd like to know how I can fix it.
def complement(seq):
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
bases = list(seq)
for element in bases:
if element not in complement:
print element
letters = [complement[base] for base in element]
return ''.join(letters)
def reverse_complement(seq):
return complement(seq[::-1])
print "Reverse Complement:"
print(reverse_complement("TCGGGCCCCX"))
Normally, DNA occurs as a double strand where each A is paired with a T and vice versa, and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the reverse complement of ACCTGAG is CTCAGGT.
reverse_complement() computes the reverse complement (duh).
Reverse Complement. Reverse Complement converts a DNA sequence into its reverse, complement, or reverse-complement counterpart. The entire IUPAC DNA alphabet is supported, and the case of each input sequence character is maintained.
The other answers are perfectly fine, but if you plan to deal with real DNA sequences I suggest using Biopython. What if you encounter a character like "-", "*" or indefinitions? What if you want to do further manipulations of your sequences? Do you want to create a parser for each file format out there?
The code you ask for is as easy as:
from Bio.Seq import Seq
seq = Seq("TCGGGCCC")
print seq.reverse_complement()
# GGGCCCGA
Now if you want to do another transformations:
print seq.complement()
print seq.transcribe()
print seq.translate()
Outputs
AGCCCGGG
UCGGGCCC
SG
And if you run into strange chars, no need to keep adding code to your program. Biopython deals with it:
seq = Seq("TCGGGCCCX")
print seq.reverse_complement()
# XGGGCCCGA
The fastest one liner for reverse complement is the following:
def rev_compl(st):
nn = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
return "".join(nn[n] for n in reversed(st))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With