Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A way of testing a set of genomic locations for exon/intron/utr?

I would like to test a bunch of genomic locations of the form:

chr4:154723876-154724615
chr6:139580853-139581090
chr18:30440532-30441569

I want to see whether they are located in an UTR or intron or exon or an intergenic sequence. I don't care for information about in which genes' introns (etc.) these coordinates are.

I assume that each known genetic element (like an exon) has defined genomic location (start-end position in the genome on each chromosome). I know this is true for exons and introns, as for example Ensembl has IDs for each exon in the genome: see example of exons and introns of Amy1 gene in Mus musclulus. I want to query a database of such locations with the above list of my locations, and if there is an overlap between the two (ideally I should be able to specify the overlap, say, at least 10bp, but if not I am OK), I should get a hit (yes, this region is in the exon/intron/)

And the handicap is that I have a few thousand of these locations and would ideally like to query them in all one go and as an output have a table where each location would be assigned "intron/exon/utr/intergenic". The organism is Mus musculus and the locations are from across the genome.

I cannot for now provide a code sample of what I am trying to do because I don't know where to start - if I had a package or anything to build upon it would help me find the solution.

Would be perfect if I could do it in R, but AFAIK I can't do it in biomaRt and I couldn't find a package to do it. I thought of Galaxy, but given their non-trivial way of doing it and strange output they produce I would rather stick to R. The devil you know etc.

Help would be much appreciated.

like image 975
yotiao Avatar asked Nov 11 '22 18:11

yotiao


1 Answers

OK, sorry it took me so long, but the paper is submitted and the way I did it finally was to:

1) Download the list of genomic coordinates for whole genes, exons, introns and so-called 3'-UTR exons and 5'-UTR exons from UCSC table browser using Ensembl gene annotation. The only finicky bit is that you have to download a file for whole genes and the rest separately, and the manual does not explicitly state what "whole gene" is. But if you paste the coordinates it produces into Genome Browser you could see it is 5' UTR, all introns and axons and 3' UTR.

2) Use BEDtools package (Quinlan and Hall 2010, https://www.ncbi.nlm.nih.gov/pubmed/20110278), a very nice manual with simple examples is here: http://bedtools.readthedocs.org/en/latest/ and used the intersect command with -f flag that let me set a minimum overlap (in bp or in %) between my coordinates and the UCSC one.

It worked like a charm - I got a tabulated file with overlaps of each feature. Hope this helps.

like image 112
yotiao Avatar answered Nov 15 '22 07:11

yotiao