you can get the data here! 2shared the bottom download I'm analyzing biological data with python. I've written down a code to find matching substrings within a list of lists of long strings. The substrings are in a list and have length of 7 nucleotides. So in the list, from AAAAAAA to TTTTTTT, 16384 motifs(substrings) are present, permutating A,C,G,T. This code has a for loop for the list of substrings and the list of lists of long strings nested inside. It works ok, but because of the list of lists with 12000lines, the code processes very slowly. In other words, to give info about AAAAAAA and to the next which is AAAAAAC takes 2 mins. so it will take 16384 motifs to go through 12000 lines for 2 mins, it will take (16384*2 == 32768 min -> 546 hours -> 22days...) I'm using scipy and numpy to get Pvalues. What I want to is counting number of presence and absence of substrings in list of sequences The list of long strings and the code are like this: <pre class="prettyprint"><code>list_of_lists_long = [ [BGN, -0.054, AGGCAGCTGCAGCCACCGCGGGGCCTCAGTGGGGGTCTCTGG....] [ABCB7, 0.109, GTCACATAAGACATTTTCTTTTTTTGTTGTTTTGGACTACAT....] [GPR143, -0.137, AGGGGATGTGCTGGGGGTCCAGACCCCATATTCCTCAGACTC....] [PLP2, -0.535, GCGAACTTCCCTCATTTCTCTCTGCAATCTGCAAATAACTCC....] [VSIG4, 0.13, AAATGCCCCATTAGGCCAGGATCTGCTGACATAATTGCCTAG....] [CCNB3, -0.071, CAGCAGCCACAGGGCTAAGCATGCATGTTAACAGGATCGGGA....] [TCEAL3, 0.189, TGCCTTTGGCCTTCCATTCTGATTTCTCTGATGAGAATACGA....] ....] #12000 lines </code></pre> Is there any faster logic to proceed the code much faster?? I need your help! Thank you in advance. ===================================== Is there any easier method, without implementing any other things? I think the iteration for pattern match is the problem... what I'm trying to find is BOTH how many times a length 7 motif occurs in the whole list of sequences AND NOT OCCURS ALSO!!!. So if a motif is present in a string, which is TRUE as bool, then increment a value AND FALSE, then increment an other value. NOT the number of motifs within a string.

Excellent question. This is a classic computer science problem. Yes, there is indeed a better algorithm. Yours processes each long string 16384 times. A better way is to process each long string only once. Rather than searching for each motif within each long string, instead you should just record which motifs appear in each long string. For example, if you were searching for length 2 motifs in the following string: <pre class="prettyprint"><code>s = 'ACGTAC' </code></pre> then you could run a loop over the length 2 substrings and record which ones are present in a <code>dict</code>: <pre class="prettyprint"><code>motifAppearances = {} for i in range(len(s)-1): motif = s[i:i+2] # grab a length=2 substring if motif not in motifAppearances: motifAppearances[motif] = 0 # initialize the count motifAppearances[motif] += 1 # increment the count </code></pre> Now you've processed the entire string exactly once and found all the motifs present in it. In this case the resulting dict would look like: <pre class="prettyprint"><code>motifAppearances = {'AC':2, 'CG':1, 'GT':1, 'TA':1} </code></pre> Doing something similar for your case should cut down your run time by exactly 16384-fold.

A clean and very fast way (~15s with OP's data) would be to use the CountVectorizer of scikits-learn as it uses numpy under the hood, for example: <pre class="prettyprint"><code>from sklearn.feature_extraction.text import CountVectorizer def make_chunks(s): width = 2 return [s[i:i+width] for i in range(len(s)-width+1)] l = ['ATTGCGGCTCACGAA', 'ACCTAGATACGACGG', 'CCCCTGTCCATGGTA'] vectorizer = CountVectorizer(tokenizer=make_chunks) X = vectorizer.fit_transform(l) </code></pre> Now <code>X</code> is a sparse matrix having all possible chunks as columns and sequence as rows, where every value is the number of occurrences of the given chunk in each sequence: <pre class="prettyprint"><code>>>> X.toarray() # aa ac ag at ca cc cg ... [[1 1 0 1 1 0 2 1 1 2 1 0 0 1 1 1] # ATTGCGGCTCACGAA [0 3 1 1 0 1 2 1 2 0 1 0 2 0 0 0] # ACCTAGATACGACGG [0 0 0 1 1 4 0 1 0 0 1 2 1 1 2 0]] # CCCCTGTCCATGGTA >>> (X.toarray()>0).astype(int) # the same but counting only once per sequence [[1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1] [0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0] [0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0]] >>> vectorizer.get_feature_names() # the columns(chunks) [u'aa', u'ac', u'ag', u'at', u'ca', u'cc', u'cg', u'ct', u'ga', u'gc', u'gg', u'gt', u'ta', u'tc', u'tg', u'tt'] </code></pre> Now you can sum along the columns, mask non-values or whatever manipulation you need to do, for example: <pre class="prettyprint"><code>>>> X.sum(axis=0) [[1 4 1 3 2 5 4 3 3 2 3 2 3 2 3 1]] </code></pre> Finally to find how many occurrences a given motif has, you must find the index of the corresponding motif/chunk and then evaluate in the previous sum: <pre class="prettyprint"><code>>>> index = vectorizer.vocabulary_.get('ag') # 'ag' is your motif 2 # this means third column </code></pre> In your case you would have to divide your list in two parts(positive and negative values) in order to include the down condition. I made a quick test with the list from DSM's answer and it takes around 3-4 seconds in my computer to run. If I use 12 000 length 4000 sequences, then it takes a little less than a minute. EDIT: The whole code using OP's data can be found here.

python counting number of presence and absence of substrings in list of sequences

Tags:

python

string

numpy

scipy

bioinformatics

you can get the data here! 2shared the bottom download

I'm analyzing biological data with python.

I've written down a code to find matching substrings within a list of lists of long strings.

The substrings are in a list and have length of 7 nucleotides.

So in the list, from AAAAAAA to TTTTTTT, 16384 motifs(substrings) are present, permutating A,C,G,T.

This code has a for loop for the list of substrings and the list of lists of long strings nested inside.

It works ok, but because of the list of lists with 12000lines, the code processes very slowly.

In other words, to give info about AAAAAAA and to the next which is AAAAAAC takes 2 mins.

so it will take 16384 motifs to go through 12000 lines for 2 mins, it will take (16384*2 == 32768 min -> 546 hours -> 22days...)

I'm using scipy and numpy to get Pvalues.

What I want to is counting number of presence and absence of substrings in list of sequences

The list of long strings and the code are like this:

list_of_lists_long  =  [
[BGN,    -0.054,     AGGCAGCTGCAGCCACCGCGGGGCCTCAGTGGGGGTCTCTGG....]
[ABCB7,  0.109,      GTCACATAAGACATTTTCTTTTTTTGTTGTTTTGGACTACAT....]
[GPR143, -0.137,     AGGGGATGTGCTGGGGGTCCAGACCCCATATTCCTCAGACTC....]
[PLP2,   -0.535,     GCGAACTTCCCTCATTTCTCTCTGCAATCTGCAAATAACTCC....]
[VSIG4,  0.13,       AAATGCCCCATTAGGCCAGGATCTGCTGACATAATTGCCTAG....]
[CCNB3,  -0.071,     CAGCAGCCACAGGGCTAAGCATGCATGTTAACAGGATCGGGA....]
[TCEAL3, 0.189,      TGCCTTTGGCCTTCCATTCTGATTTCTCTGATGAGAATACGA....]
....] #12000 lines

Is there any faster logic to proceed the code much faster??

I need your help!

Thank you in advance.

=====================================

Is there any easier method, without implementing any other things?

I think the iteration for pattern match is the problem...

what I'm trying to find is BOTH how many times a length 7 motif occurs in the whole list of sequences AND NOT OCCURS ALSO!!!. So if a motif is present in a string, which is TRUE as bool, then increment a value AND FALSE, then increment an other value.

NOT the number of motifs within a string.

560

asked Nov 16 '13 15:11

Karyo

2 Answers

Excellent question. This is a classic computer science problem. Yes, there is indeed a better algorithm. Yours processes each long string 16384 times. A better way is to process each long string only once.

Rather than searching for each motif within each long string, instead you should just record which motifs appear in each long string. For example, if you were searching for length 2 motifs in the following string:

s = 'ACGTAC'

then you could run a loop over the length 2 substrings and record which ones are present in a dict:

motifAppearances = {}
for i in range(len(s)-1):
    motif = s[i:i+2]                   # grab a length=2 substring
    if motif not in motifAppearances:
        motifAppearances[motif] = 0    # initialize the count
    motifAppearances[motif] += 1       # increment the count

Now you've processed the entire string exactly once and found all the motifs present in it. In this case the resulting dict would look like:

motifAppearances = {'AC':2, 'CG':1, 'GT':1, 'TA':1}

Doing something similar for your case should cut down your run time by exactly 16384-fold.

answered Sep 30 '22 21:09

dg99

A clean and very fast way (~15s with OP's data) would be to use the CountVectorizer of scikits-learn as it uses numpy under the hood, for example:

from sklearn.feature_extraction.text import CountVectorizer

def make_chunks(s):
    width = 2
    return [s[i:i+width] for i in range(len(s)-width+1)]

l = ['ATTGCGGCTCACGAA', 'ACCTAGATACGACGG', 'CCCCTGTCCATGGTA']

vectorizer = CountVectorizer(tokenizer=make_chunks)
X = vectorizer.fit_transform(l)

Now X is a sparse matrix having all possible chunks as columns and sequence as rows, where every value is the number of occurrences of the given chunk in each sequence:

>>> X.toarray()
# aa ac ag at ca cc cg ...
[[1  1  0  1  1  0  2  1 1 2 1 0 0 1 1 1]      # ATTGCGGCTCACGAA
 [0  3  1  1  0  1  2  1 2 0 1 0 2 0 0 0]      # ACCTAGATACGACGG
 [0  0  0  1  1  4  0  1 0 0 1 2 1 1 2 0]]     # CCCCTGTCCATGGTA

>>> (X.toarray()>0).astype(int)  # the same but counting only once per sequence
[[1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1]
 [0 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0]
 [0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0]]

>>> vectorizer.get_feature_names()     # the columns(chunks)
[u'aa', u'ac', u'ag', u'at', u'ca', u'cc', u'cg', u'ct', u'ga', u'gc', u'gg', u'gt', u'ta', u'tc', u'tg', u'tt']

Now you can sum along the columns, mask non-values or whatever manipulation you need to do, for example:

>>> X.sum(axis=0)
[[1 4 1 3 2 5 4 3 3 2 3 2 3 2 3 1]]

Finally to find how many occurrences a given motif has, you must find the index of the corresponding motif/chunk and then evaluate in the previous sum:

>>> index = vectorizer.vocabulary_.get('ag')    # 'ag' is your motif
2   # this means third column

In your case you would have to divide your list in two parts(positive and negative values) in order to include the down condition. I made a quick test with the list from DSM's answer and it takes around 3-4 seconds in my computer to run. If I use 12 000 length 4000 sequences, then it takes a little less than a minute.

EDIT: The whole code using OP's data can be found here.

answered Sep 30 '22 21:09

elyase

Related questions
                            
                                Foreign Key in Django model
                            
                                Python -- Ancient Taxation
                            
                                How to see output of print statements when scrapy logger enabled
                            
                                understanding python twisted asynchronicity in terms of operating system
                            
                                Finding the intersection of a curve from polyfit
                            
                                RBF interpolation: LinAlgError: singular matrix
                            
                                Execute Shell Script from Python with multiple pipes
                            
                                How to log error messages with Flask and foreman (heroku)?
                            
                                How to use super to initialize all the parents when using multiple inheritance
                            
                                Plotting stochastic processes in Python
                            
                                How to Normalize Names
                            
                                In python, can one iterate through large text files using buffers and get the correct file position at the same time?
                            
                                How to get a 16bit Unsigned integer in python
                            
                                How to strip the string which contain forward slash?
                            
                                How can I uninstall Flask?
                            
                                Combining memoization and tail call optimization
                            
                                Cannot use Python select.poll in Mac OS?
                            
                                Installing Pyomo on Windows with Anaconda (Python)
                            
                                How do I compile Pyparsing with Cython on WIndows?
                            
                                Get start and stop indexes of overlapping matches?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With