Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find first non-uppercase letter in the string using python

Tags:

python

I'm a python newbie but have programmed a while in other languages. I have a long string of DNA (lower case) and AA sequences (upper case). Further at the start of the file I have a protein name all in upper case. Thus my file looks like this.

PROTEINNAMEatcgatcg... JFENVKDFDFLK

I need to find the first non-uppercase letter in the string so I can then cut out the protein name. Thus, what I would want from the above is:

atcgatcg... JFENVKDFDFLK

I can do this with a loop but that seems like overkill and inefficient. Is there a simply python way to do it?

I can get all the uppercase letters using re.findall("[A-Z]",mystring) but then I would need to do a comparison to see where the result differs from the original string.

Thanks!

like image 696
user1357015 Avatar asked Apr 25 '12 19:04

user1357015


3 Answers

You are almost there with your regex... but there are other methods besides findall:

http://docs.python.org/library/re.html#re.sub

>>> import re
>>> protein_regex = re.compile('^[A-Z]+')
>>> dna = 'PROTEINNAMEatcgatcg... JFENVKDFDFLK'
>>> protein_regex.sub('', dna)
'atcgatcg... JFENVKDFDFLK'

Not sure about performance, but you could also do

>>> import string
>>> dna.lstrip(string.uppercase)
'atcgatcg... JFENVKDFDFLK'

And there you have it:

python -m timeit -n 10000 -s 'import re' -s 'protein_regex = re.compile("^[A-Z]+")' -s 'dna = "PROTEINNAMEatcgatcg... JFENVKDFDFLK"' 'protein_regex.sub("", dna)'
10000 loops, best of 3: 1.36 usec per loop

python -m timeit -n 10000 -s 'import string' -s 'dna = "PROTEINNAMEatcgatcg... JFENVKDFDFLK"' 'dna.lstrip(string.uppercase)'
10000 loops, best of 3: 0.444 usec per loop

Second one looks to be ~3 times faster.

like image 167
sberry Avatar answered Nov 15 '22 16:11

sberry


Use re.search():

import re
s1 = "ASDFASDFASDFasdfasdfasdfasdfasdf"
m = re.search("[a-z]", s1)
if m:
    print "Digit found at position %d" % m.start()
else:
    print "No digit in that string"
like image 1
Nicholas DiPiazza Avatar answered Nov 15 '22 18:11

Nicholas DiPiazza


Try this, it's as short as it can get:

import re
s = 'PROTEINNAMEatcgatcg... JFENVKDFDFLK'
i = re.search('[a-z]', s).start()
protein, sequences = s[:i], s[i:]

print protein
> PROTEINNAME

print sequences
> atcgatcg... JFENVKDFDFLK
like image 1
Óscar López Avatar answered Nov 15 '22 17:11

Óscar López