I've written a function to count the occurences of certain characters (<code>A</code>, <code>C</code>, <code>G</code> and <code>T</code>) within multiple strings at the same position and save the number of occurrences in a dictionary. For example with these two strings 'ACGG' and 'CAGT', it should return: <pre class="prettyprint"><code>{'A': [1, 1, 0, 0], 'C': [1, 1, 0, 0], 'G': [0, 0, 2, 1], 'T': [0, 0, 0, 1]} </code></pre> I want to convert the code below to list comprehension to optimize it for speed. It uses two nested for loops, and the input Motifs is a list of strings containing A's C's G's and T's. <pre class="prettyprint"><code>def CountWithPseudocounts(Motifs): count = {} k = len(Motifs[0]) t = len(Motifs) for s in 'ACGT': count[s] = [0] * k for i in range(t): for j in range(k): symbol = Motifs[i][j] count[symbol][j] += 1 return count </code></pre> I've tried replacing the nested for loops at the bottom of the function for this list comprehension: <pre class="prettyprint"><code>count = [ [ count[Motifs[i][j]][j] += 1 ] for i in range(0, t) ] for j in range(0, k)] </code></pre> It doesn't work, probably because I'm not allowed to do the value assignment of += 1 within the list comprehension. How can I work around this?

You can use <code>zip()</code>: <pre class="prettyprint"><code>In [10]: a = 'ACGG' In [11]: b = 'CAGT' In [12]: chars = ['A', 'C', 'G', 'T'] In [13]: [[(ch==i) + (ch==j) for i, j in zip(a, b)] for ch in chars] Out[13]: [[1, 1, 0, 0], [1, 1, 0, 0], [0, 0, 2, 1], [0, 0, 0, 1]] </code></pre> If you want a dictionary you can use a dict comprehension: <pre class="prettyprint"><code>In [25]: {ch:[(ch==i) + (ch==j) for i, j in zip(a, b)] for ch in chars} Out[25]: {'T': [0, 0, 0, 1], 'G': [0, 0, 2, 1], 'C': [1, 1, 0, 0], 'A': [1, 1, 0, 0]} </code></pre> Or if you want the result in same order as your character list, you can use <code>collections.OrderedDict</code>: <pre class="prettyprint"><code>In [26]: from collections import OrderedDict In [27]: OrderedDict((ch, [(ch==i) + (ch==j) for i, j in zip(a, b)]) for ch in chars) Out[28]: OrderedDict([('A', [1, 1, 0, 0]), ('C', [1, 1, 0, 0]), ('G', [0, 0, 2, 1]), ('T', [0, 0, 0, 1])]) </code></pre> If you still need more performance and/or you're dealing with long strings and larger data sets you can use Numpy to get around this problem though a vectorized method. <pre class="prettyprint"><code>In [61]: pairs = np.array((list(a), list(b))).T In [62]: chars Out[62]: array(['A', 'C', 'G', 'T'], dtype='<U1') In [63]: (chars[:,None,None] == pairs).sum(2) Out[63]: array([[1, 1, 0, 0], [1, 1, 0, 0], [0, 0, 2, 1], [0, 0, 0, 1]]) </code></pre>

You can indeed not do assignments in list comprehension (well you can - by calling functions - perform side effects). A list comprehension expects an expression. Furthermore it is weird that you want to assign to <code>count</code> and at the same time update an old <code>count</code>. A way to do this with dictionary comprehension and list comprehension that is not very efficient is: <pre class="prettyprint"><code>chars = 'ACGT' a = 'ACGG' b = 'CAGT' sequences = list(zip(a,b)) counts = {char:[seq.count(char) for seq in sequences] for char in chars} </code></pre> (credits to @Chris_Rands for the <code>seq.count(char)</code> suggestion) This produces: <pre class="prettyprint"><code>{'G': [0, 0, 2, 1], 'A': [1, 1, 0, 0], 'C': [1, 1, 0, 0], 'T': [0, 0, 0, 1]} </code></pre> You can easily generalize the solution to count more strings by calling <code>zip(..)</code> with more strings. You can also decide to optimize your algorithm itself. This will probably be more effective since then you only have to loop over the strings once and you can use the lookup of a dictionary, like: <pre class="prettyprint"><code>def CountWithPseudocounts(sequences): k = len(sequences[0]) count = {char:[0]*k for char in 'ACGT'} for sequence in sequences: j = 0 for symbol in sequence: count[symbol][j] += 1 j += 1 return count </code></pre> EDIT: If you want to add one to all elements in the counts you can use: <pre class="prettyprint"><code>counts = {char:[seq.count(char)+1 for seq in sequences] for char in chars}</code></pre>

Replacing nested for loops and value assignment for list comprehension

Tags:

python

optimization

for-loop

list-comprehension

I've written a function to count the occurences of certain characters (A, C, G and T) within multiple strings at the same position and save the number of occurrences in a dictionary.

For example with these two strings 'ACGG' and 'CAGT', it should return:

{'A': [1, 1, 0, 0], 'C': [1, 1, 0, 0], 'G': [0, 0, 2, 1], 'T': [0, 0, 0, 1]}

I want to convert the code below to list comprehension to optimize it for speed. It uses two nested for loops, and the input Motifs is a list of strings containing A's C's G's and T's.

def CountWithPseudocounts(Motifs):
    count = {}
    k = len(Motifs[0])
    t = len(Motifs)
    for s in 'ACGT':
        count[s] = [0] * k
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
return count

I've tried replacing the nested for loops at the bottom of the function for this list comprehension:

count = [ [ count[Motifs[i][j]][j] += 1 ] for i in range(0, t) ] for j in range(0, k)]

It doesn't work, probably because I'm not allowed to do the value assignment of += 1 within the list comprehension. How can I work around this?

301

asked Mar 08 '17 13:03

DavidK11

2 Answers

You can use zip():

In [10]: a = 'ACGG'           

In [11]: b = 'CAGT'

In [12]: chars = ['A', 'C', 'G', 'T'] 

In [13]: [[(ch==i) + (ch==j) for i, j in zip(a, b)] for ch in chars]
Out[13]: [[1, 1, 0, 0], [1, 1, 0, 0], [0, 0, 2, 1], [0, 0, 0, 1]]

If you want a dictionary you can use a dict comprehension:

In [25]: {ch:[(ch==i) + (ch==j) for i, j in zip(a, b)] for ch in chars}
Out[25]: {'T': [0, 0, 0, 1], 'G': [0, 0, 2, 1], 'C': [1, 1, 0, 0], 'A': [1, 1, 0, 0]}

Or if you want the result in same order as your character list, you can use collections.OrderedDict:

In [26]: from collections import OrderedDict

In [27]: OrderedDict((ch, [(ch==i) + (ch==j) for i, j in zip(a, b)]) for ch in chars)
Out[28]: OrderedDict([('A', [1, 1, 0, 0]), ('C', [1, 1, 0, 0]), ('G', [0, 0, 2, 1]), ('T', [0, 0, 0, 1])])

If you still need more performance and/or you're dealing with long strings and larger data sets you can use Numpy to get around this problem though a vectorized method.

In [61]: pairs = np.array((list(a), list(b))).T

In [62]: chars
Out[62]: 
array(['A', 'C', 'G', 'T'], 
      dtype='<U1')

In [63]: (chars[:,None,None] == pairs).sum(2)
Out[63]: 
array([[1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 2, 1],
       [0, 0, 0, 1]])

166

answered Sep 20 '22 15:09

Mazdak

You can indeed not do assignments in list comprehension (well you can - by calling functions - perform side effects). A list comprehension expects an expression. Furthermore it is weird that you want to assign to count and at the same time update an old count.

A way to do this with dictionary comprehension and list comprehension that is not very efficient is:

chars = 'ACGT'

a = 'ACGG'
b = 'CAGT'

sequences = list(zip(a,b))

counts = {char:[seq.count(char) for seq in sequences] for char in chars}

(credits to @Chris_Rands for the seq.count(char) suggestion)

This produces:

{'G': [0, 0, 2, 1], 'A': [1, 1, 0, 0], 'C': [1, 1, 0, 0], 'T': [0, 0, 0, 1]}

You can easily generalize the solution to count more strings by calling zip(..) with more strings.

You can also decide to optimize your algorithm itself. This will probably be more effective since then you only have to loop over the strings once and you can use the lookup of a dictionary, like:

def CountWithPseudocounts(sequences):
    k = len(sequences[0])
    count = {char:[0]*k for char in 'ACGT'}
    for sequence in sequences:
        j = 0
        for symbol in sequence:
            count[symbol][j] += 1
            j += 1
    return count

EDIT:

If you want to add one to all elements in the counts you can use:

counts = {char:[seq.count(char)+1 for seq in sequences] for char in chars}

answered Sep 17 '22 15:09

Willem Van Onsem

Related questions
                            
                                Calculate curl of a vector field in Python and plot it with matplotlib
                            
                                Calculate Distance to Nearest Feature with Geopandas
                            
                                How to get array of random integers of non-default type in numpy
                            
                                Converting a Python XML ElementTree to a String
                            
                                Seaborn Heatmap Key Words
                            
                                paramiko python module hangs at stdout.read()
                            
                                Is it possible to add hatches to each individual bar in seaborn.barplot?
                            
                                Convert file to base64 string on Python 3
                            
                                Why can't I import COUNTRIES from pygal.i18n
                            
                                Get scrapy spider to crawl entire site
                            
                                TypeError: must be string, not datetime.datetime when using strptime
                            
                                How to send asynchronous request using flask to an endpoint with small timeout session?
                            
                                Getting deprecation warning in Sklearn over 1d array, despite not having a 1D array
                            
                                Pandas.dataframe.query() - fetch not null rows (Pandas equivalent to SQL: "IS NOT NULL")
                            
                                PyQt: give parent when creating a widget?
                            
                                Python Multiprocessing - How to pass kwargs to function?
                            
                                eval() and run() in tensorflow
                            
                                Paramiko: Add host_key to known_hosts permanently
                            
                                cryptography AssertionError: sorry, but this version only supports 100 named groups
                            
                                Error using langdetect in python: "No features in text"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With