Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding frequent string patterns using python

So I am trying to do this problem where I have to find the most frequent 6-letter string within some lines in python, so I realize one could do something like this:

>>> from collections import Counter
>>> x = Counter("ACGTGCA")
>>> x
Counter({'A': 2, 'C': 2, 'G': 2, 'T': 1})

Now then, the data I'm using is DNA files and the format of a file goes something like this:

> name of the protein
ACGTGCA ... < more sequences> 
ACGTGCA ... < more sequences> 
ACGTGCA ... < more sequences> 
ACGTGCA ... < more sequences> 

> another protein 
AGTTTCAGGAC ... <more sequences>
AGTTTCAGGAC ... <more sequences>
AGTTTCAGGAC ... <more sequences>
AGTTTCAGGAC ... <more sequences>

We can start by going one protein at a time, but then how does one modify the block of code above to search for the most frequent 6-character string pattern? Thanks.

like image 715
dhillonv10 Avatar asked Nov 26 '11 19:11

dhillonv10


People also ask

How do you find the maximum occurrence of a word in a string in python?

Method 2 : Using collections.Counter() + max() We find maximum occurring character by using max() on values.

What is pattern matching in Python?

Pattern matching involves providing a pattern and an associated action to be taken if the data fits the pattern. At its simplest, pattern matching works like the switch statement in C/ C++/ JavaScript or Java. Matching a subject value against one or more cases.


2 Answers

I think the simplest way is just do this:

>>> from collections import Counter
>>> protein = "AGTTTCAGGAC"
>>> Counter(protein[i:i+6] for i in range(len(protein)-5))
Counter({'TTCAGG': 1, 'AGTTTC': 1, 'CAGGAC': 1, 'TCAGGA': 1, 'GTTTCA': 1, 'TTTCAG': 1})
like image 151
Duncan Avatar answered Sep 21 '22 15:09

Duncan


The old itertools docs (via this answer) provide the window function, which is a generic version of @Duncan's answer.

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

Then you can just do

collections.Counter(window(x))

Personally I'd still go with string slicing, but this is a generic version if you want it.

like image 39
Katriel Avatar answered Sep 19 '22 15:09

Katriel