INTRODUCTION: I'm a bioinformatician. In my analysis which I perform on all human genes (about 20 000) I search for a particular short sequence motif to check how many times this motif occurs in each gene.
Genes are 'written' in a linear sequence in four letters (A,T,G,C). For example: CGTAGGGGGTTTAC... This is the four-letter alphabet of genetic code which is like the secret language of each cell, it’s how DNA actually stores information.
I suspect that frequent repetitions of a particular short motif sequence (AGTGGAC) in some genes are crucial in a specific biochemical process in the cell. Since the motif itself is very short it is difficult with computational tools to distinguish between true functional examples in genes and those that look similar by chance. To avoid this problem I get sequences of all genes and concatenated into a single string and shuffled. The length of each of the original genes was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking A or T or G or C at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution, as well as the overall A,T,G,C composition. Then I search for the motif in these randomized sequences. I performed this procedure 1000 times and averaged the results.
So even after 1000 times randomization of true genetic code, there aren't any genes which have more than 6 motifs. But in the true genetic code, there are a few genes which contain more then 20 occurrences of the motif, which suggest that these repetition might be functional and it's unlikely to find them in such an abundance by pure chance.
PROBLEM:
I would like to know the probability of finding a gene with let's say 20 occurrences of the motif in my distribution. So I want to know the probability to find such a gene by chance. I would like to implement this in Python, but I don't know how.
Can I do such an analysis in Python?
Any help would be appreciated.
Return estimates of shape (if applicable), location, and scale parameters from data.
To fit a symmetrical distribution to data obeying a negatively skewed distribution (i.e. skewed to the left, with mean < mode, and with a right hand tail this is shorter than the left hand tail) one could use the squared values of the data to accomplish the fit.
In SciPy documentation you will find a list of all implemented continuous distribution functions. Each one has a fit()
method, which returns the corresponding shape parameters.
Even if you don't know which distribution to use you can try many distrubutions simultaneously and choose the one that fits better to your data, like in the code below. Note that if you have no idea about the distribution it may be difficult to fit the sample.
import matplotlib.pyplot as plt import scipy import scipy.stats size = 20000 x = scipy.arange(size) # creating the dummy sample (using beta distribution) y = scipy.int_(scipy.round_(scipy.stats.beta.rvs(6,2,size=size)*47)) # creating the histogram h = plt.hist(y, bins=range(48)) dist_names = ['alpha', 'beta', 'arcsine', 'weibull_min', 'weibull_max', 'rayleigh'] for dist_name in dist_names: dist = getattr(scipy.stats, dist_name) params = dist.fit(y) arg = params[:-2] loc = params[-2] scale = params[-1] if arg: pdf_fitted = dist.pdf(x, *arg, loc=loc, scale=scale) * size else: pdf_fitted = dist.pdf(x, loc=loc, scale=loc) * size plt.plot(pdf_fitted, label=dist_name) plt.xlim(0,47) plt.legend(loc='upper left') plt.show()
References:
- Distribution fitting with Scipy
- Fitting empirical distribution to theoretical ones with Scipy (Python)?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With