Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

Tags:

INTRODUCTION: I'm a bioinformatician. In my analysis which I perform on all human genes (about 20 000) I search for a particular short sequence motif to check how many times this motif occurs in each gene.

Genes are 'written' in a linear sequence in four letters (A,T,G,C). For example: CGTAGGGGGTTTAC... This is the four-letter alphabet of genetic code which is like the secret language of each cell, it’s how DNA actually stores information.

I suspect that frequent repetitions of a particular short motif sequence (AGTGGAC) in some genes are crucial in a specific biochemical process in the cell. Since the motif itself is very short it is difficult with computational tools to distinguish between true functional examples in genes and those that look similar by chance. To avoid this problem I get sequences of all genes and concatenated into a single string and shuffled. The length of each of the original genes was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking A or T or G or C at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution, as well as the overall A,T,G,C composition. Then I search for the motif in these randomized sequences. I performed this procedure 1000 times and averaged the results.

  • 15000 genes that do not contain a given motif
  • 5000 genes that contain 1 motif
  • 3000 genes that contain 2 motifs
  • 1000 genes that contain 3 motifs
  • ...
  • 1 gene that contain 6 motifs

So even after 1000 times randomization of true genetic code, there aren't any genes which have more than 6 motifs. But in the true genetic code, there are a few genes which contain more then 20 occurrences of the motif, which suggest that these repetition might be functional and it's unlikely to find them in such an abundance by pure chance.

PROBLEM:

I would like to know the probability of finding a gene with let's say 20 occurrences of the motif in my distribution. So I want to know the probability to find such a gene by chance. I would like to implement this in Python, but I don't know how.

Can I do such an analysis in Python?

Any help would be appreciated.

like image 697
s_sherly Avatar asked Jul 07 '11 18:07

s_sherly


People also ask

What does Scipy fit return?

Return estimates of shape (if applicable), location, and scale parameters from data.

How do you fit a data distribution?

To fit a symmetrical distribution to data obeying a negatively skewed distribution (i.e. skewed to the left, with mean < mode, and with a right hand tail this is shorter than the left hand tail) one could use the squared values of the data to accomplish the fit.


1 Answers

In SciPy documentation you will find a list of all implemented continuous distribution functions. Each one has a fit() method, which returns the corresponding shape parameters.

Even if you don't know which distribution to use you can try many distrubutions simultaneously and choose the one that fits better to your data, like in the code below. Note that if you have no idea about the distribution it may be difficult to fit the sample.

enter image description here

import matplotlib.pyplot as plt import scipy import scipy.stats size = 20000 x = scipy.arange(size) # creating the dummy sample (using beta distribution) y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47)) # creating the histogram h = plt.hist(y, bins=range(48))  dist_names = ['alpha', 'beta', 'arcsine',               'weibull_min', 'weibull_max', 'rayleigh']  for dist_name in dist_names:     dist = getattr(scipy.stats, dist_name)     params = dist.fit(y)     arg = params[:-2]     loc = params[-2]     scale = params[-1]     if arg:         pdf_fitted = dist.pdf(x, *arg, loc=loc, scale=scale) * size     else:         pdf_fitted = dist.pdf(x, loc=loc, scale=loc) * size     plt.plot(pdf_fitted, label=dist_name)     plt.xlim(0,47) plt.legend(loc='upper left') plt.show() 

References:

- Distribution fitting with Scipy

- Fitting empirical distribution to theoretical ones with Scipy (Python)?

like image 110
Saullo G. P. Castro Avatar answered Sep 24 '22 07:09

Saullo G. P. Castro