Motif search with Gibbs sampler

Tags:

I am a beginner in both programming and bioinformatics. So, I would appreciate your understanding. I tried to develop a python script for motif search using Gibbs sampling as explained in Coursera class, "Finding Hidden Messages in DNA". The pseudocode provided in the course is:

Click to copy

GIBBSSAMPLER(Dna, k, t, N)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string
        from Dna
    BestMotifs ← Motifs
    for j ← 1 to N
        i ← Random(t)
        Profile ← profile matrix constructed from all strings in Motifs
                   except for Motifi
        Motifi ← Profile-randomly generated k-mer in the i-th sequence
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
    return BestMotifs

Problem description:

CODE CHALLENGE: Implement GIBBSSAMPLER.

Input: Integers k, t, and N, followed by a collection of strings Dna. Output: The strings BestMotifs resulting from running GIBBSSAMPLER(Dna, k, t, N) with 20 random starts. Remember to use pseudocounts!

Sample Input:

Click to copy

 8 5 100
 CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA
 GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
 TAGTACCGAGACCGAAAGAAGTATACAGGCGT
 TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
 AATCCACCAGCTCCACGTGCAATGTTGGCCTA

Sample Output:

Click to copy

 TCTCGGGG
 CCAAGGTG
 TACAGGCG
 TTCAGGTG
 TCCACGTG

I followed the pseudocode to the best of my knowledge. Here is my code:

Click to copy

def BuildProfileMatrix(dnamatrix):
    ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    for seq in dnamatrix:
    for i in xrange(len(dnamatrix[0])):            
        ProfileMatrix[indices[seq[i]]][i] += 1
    ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
    return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    score_list = []
    for x in xrange(len(dna[i]) - k + 1):
        probability = 1
        window = dna[i][x : k + x]
    for y in xrange(k):
        probability *= profile[indices[window[y]]][y]
    score_list.append(probability)
    rnd = uniform(0, sum(score_list))
    current = 0
    for z, bias in enumerate(score_list):
        current += bias
        if rnd <= current:
            return dna[i][z : k + z]
def score(motifs):
    ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
    indices = {'A':0, 'C':1, 'G': 2, 'T':3}
    for seq in motifs:
        for i in xrange(len(motifs[0])):            
            ProfileMatrix[indices[seq[i]]][i] += 1
    score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
    return score
from random import randint, uniform    
def GibbsSampler(k, t, N):
     dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
    'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
    'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
    'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
    'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
    Motifs = []
    for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
        j = 0
        kmer = dna[j][i : k+i]
        j += 1
        Motifs.append(kmer)
    BestMotifs = []
    s_best = float('inf')
    for i in xrange(N):
        x = randint(0, t-1)
    Motifs.pop(x)
    profile = BuildProfileMatrix(Motifs)
    Motif = ProfileRandomGenerator(profile, dna, k, x)
    Motifs.append(Motif)
    s_motifs = score(Motifs)
    if s_motifs < s_best:
        s_best = s_motifs
        BestMotifs = Motifs
return [s_best, BestMotifs]

k, t, N =8, 5, 100            
best_motifs = [float('inf'), None]

# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
    current_motifs = GibbsSampler(k, t, N)
    if current_motifs[0] < best_motifs[0]:
        best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])

Unfortunately, my code never gives the same output as the solved example. Besides, while trying to debug the code I found that I get weird scores that define the mismatches between motifs. However, when I tried to run the score function separately, it worked perfectly.

Each time I run the script, the output changes, but anyway here is an example of one of the outputs for the input present in the code:

Example output of my code

Click to copy

TATGTGTA
TATGTGTA
TATGTGTA
GGTGTTCA
TATACAGG

Could you please help me debug this code?!! I spent the whole day trying to find out what's wrong with it although I know it might be some silly mistake I made, but my eye failed to catch it.

Thank you all!!

846

asked Feb 27 '16 23:02

Ali Elbehery

1 Answers

Finally, I found out what was wrong in my code! It was in line 54:

Click to copy

Motifs.append(Motif)

After randomly removing one of the motifs, followed by building a profile out of these motifs then randomly selecting a new motif based on this profile, I should have added the selected motif in the same position before removal NOT appended to the end of the motif list.

Now, the correct code is:

Click to copy

Motifs.insert(x, Motif)

The new code worked as expected.

175

answered Nov 11 '22 10:11

Ali Elbehery

Related questions
                            
                                Update/uninstall with Pip packages installed with apt (and vice versa)
                            
                                TypeError in SOAP Request (using pysimplesoap)
                            
                                How to pass weights argument to seaborn's jointplot() or the underlying kdeplot?
                            
                                User authentication using django-oauth-toolkit
                            
                                Using Flask and native Python logging?
                            
                                Redirecting `sys.stdout` to a file or a buffer
                            
                                Pickle file size when pickling numpy arrays or lists
                            
                                Using Gradle to build Python application
                            
                                Using tox with Anaconda python
                            
                                How to mount and unmount on windows [closed]
                            
                                Pydot error: file format "png" not recognized
                            
                                Error while importing Tensorflow in python2.7 in Red Hat release 6.6. 'GLIBC_2.17 not found'
                            
                                Theano CUDA exception
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Why is Garbage Collection so Slow?
                            
                                Anaconda 3.5 (64bit Windows) Install cx_Oracle
                            
                                Create a formal linear function in Sympy
                            
                                TensorFlow installation results in ImportError: No module named tensorflow
                            
                                py2exe the following modules appear to be missing
                            
                                Pandas.read_excel reads date into timestamp, I want a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Motif search with Gibbs sampler

Tags:

python

algorithm

bioinformatics

Ali Elbehery

People also ask

1 Answers

Ali Elbehery

Recent Activity

Donate For Us