Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

Tags:

INTRODUCTION: I'm a bioinformatician. In my analysis which I perform on all human genes (about 20 000) I search for a particular short sequence motif to check how many times this motif occurs in each gene.

Genes are 'written' in a linear sequence in four letters (A,T,G,C). For example: CGTAGGGGGTTTAC... This is the four-letter alphabet of genetic code which is like the secret language of each cell, it’s how DNA actually stores information.

I suspect that frequent repetitions of a particular short motif sequence (AGTGGAC) in some genes are crucial in a specific biochemical process in the cell. Since the motif itself is very short it is difficult with computational tools to distinguish between true functional examples in genes and those that look similar by chance. To avoid this problem I get sequences of all genes and concatenated into a single string and shuffled. The length of each of the original genes was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking A or T or G or C at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution, as well as the overall A,T,G,C composition. Then I search for the motif in these randomized sequences. I performed this procedure 1000 times and averaged the results.

15000 genes that do not contain a given motif
5000 genes that contain 1 motif
3000 genes that contain 2 motifs
1000 genes that contain 3 motifs
...
1 gene that contain 6 motifs

So even after 1000 times randomization of true genetic code, there aren't any genes which have more than 6 motifs. But in the true genetic code, there are a few genes which contain more then 20 occurrences of the motif, which suggest that these repetition might be functional and it's unlikely to find them in such an abundance by pure chance.

PROBLEM:

I would like to know the probability of finding a gene with let's say 20 occurrences of the motif in my distribution. So I want to know the probability to find such a gene by chance. I would like to implement this in Python, but I don't know how.

Can I do such an analysis in Python?

Any help would be appreciated.

697

asked Jul 07 '11 18:07

s_sherly

1 Answers

In SciPy documentation you will find a list of all implemented continuous distribution functions. Each one has a fit() method, which returns the corresponding shape parameters.

Even if you don't know which distribution to use you can try many distrubutions simultaneously and choose the one that fits better to your data, like in the code below. Note that if you have no idea about the distribution it may be difficult to fit the sample.

enter image description here

import matplotlib.pyplot as plt import scipy import scipy.stats size = 20000 x = scipy.arange(size) # creating the dummy sample (using beta distribution) y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47)) # creating the histogram h = plt.hist(y, bins=range(48))  dist_names = ['alpha', 'beta', 'arcsine',               'weibull_min', 'weibull_max', 'rayleigh']  for dist_name in dist_names:     dist = getattr(scipy.stats, dist_name)     params = dist.fit(y)     arg = params[:-2]     loc = params[-2]     scale = params[-1]     if arg:         pdf_fitted = dist.pdf(x, *arg, loc=loc, scale=scale) * size     else:         pdf_fitted = dist.pdf(x, loc=loc, scale=loc) * size     plt.plot(pdf_fitted, label=dist_name)     plt.xlim(0,47) plt.legend(loc='upper left') plt.show()

References:

- Distribution fitting with Scipy

- Fitting empirical distribution to theoretical ones with Scipy (Python)?

110

answered Sep 24 '22 07:09

Saullo G. P. Castro

Related questions
                            
                                How to set the IP address from C in linux
                            
                                using JDBC preparedStatement in a batch
                            
                                C# List<> Add() method performance
                            
                                How do I add an image to a tweet or how does twitter decide which image urls to render
                            
                                Stubbing a Property get using Rhino Mocks
                            
                                "Program is not a recognized executable" error in Eclipse
                            
                                How do you compile macros in a Lisp compiler?
                            
                                Programmatically Add Image to TableView Cell
                            
                                Simulating the TAB keydown: focusing next element as determined by `tabIndex`
                            
                                How to query axis limits in R?
                            
                                Google+ API: How can I use RefreshTokens to avoid requesting access every time my app launches?
                            
                                StringBuilder Class OutOfMemoryException

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With