I would like to generate a plot depicting 14 linear chromosomes for the organism I work on, to scale, with coloured bars at specified locations along each chromosome. Ideally I'd like to use R as this is the only programming language I have experience with.
I have explored various ways of doing this e.g. with GenomeGraphs but I have found this is all more complicated than what I want/ displays a lot more data than what I have (e.g. displaying cytogenic bands) and is often specific for human chromosomes.
All I essentially want is 14 grey bars of the following sizes:
chromosome size
1 640851
2 947102
3 1067971
4 1200490
5 1343557
6 1418242
7 1445207
8 1472805
9 1541735
10 1687656
11 2038340
12 2271494
13 2925236
14 3291936
And then to have coloured marks depicting about 150 locations scattered along the chromosome lengths. e.g. marks at these loci:
Chromosome Position
3 817702
12 1556936
13 1131566
Ideally I would also like to be able to specify a few different colours depending on the loci, e.g.
Chromosome Position Type
3 817702 A
12 1556936 A
13 1131566 A
5 1041685 B
11 488717 B
14 1776463 B
Where 'A' was marked in blue and 'B' was marked in green, for example.
A very similar plot to what I would like to produce is pasted in this image (from Bopp et al. PlOS Genetics 2013;9(2):e1003293):
Can anyone recommend a way of doing this? It doesn't necessarily have to be a bioinformatics package, if there is another way I can use R to generate 14 bars of certain proportional sizes with markings at specified locations along the bars. e.g. I've been thinking about modifying a simple bar chart from ggplot2 but I don't know how to put the markings along the bars at specific locations.
The most common tool scientists use to visualize chromosomes or parts of chromosomes is called fluorescence in situ hybridization, or FISH. FISH uses a piece of DNA, called a probe, that is labeled with a fluorescent molecule to “paint” each chromosome pair a different color.
The number indicating the gene position increases with distance from the centromere. For example: 14q21 represents position 21 on the long arm of chromosome 14. 14q21 is closer to the centromere than 14q22.
Finally, chromosome maps, called idiograms, provide a pictorial reference point that is useful for locating the positions of individual genes on chromosomes, as well as for identifying various abnormalities associated with a range of chromosomal disorders.
Each chromosome has a distinct banding pattern, and each band is numbered to help identify a particular region of a chromosome. This method of mapping a gene to a particular band of the chromosome is called cytogenetic mapping.
Just save your barplot
call and then call segments
to make the marks at an appropriate location. E.g.:
bp <- barplot(dat$size, border=NA, col="grey80")
with(marks,
segments(
bp[Chromosome,]-0.5,
Position,
bp[Chromosome,]+0.5,
Position,
col=Type,
lwd=2,
lend=1
)
)
Data used:
dat <- structure(list(chromosome = 1:14, size = c(640851L, 947102L,
1067971L, 1200490L, 1343557L, 1418242L, 1445207L, 1472805L, 1541735L,
1687656L, 2038340L, 2271494L, 2925236L, 3291936L)), .Names = c("chromosome",
"size"), class = "data.frame", row.names = c(NA, -14L))
marks <- structure(list(Chromosome = c(3L, 12L, 13L, 5L, 11L, 14L), Position = c(817702L,
1556936L, 1131566L, 1041685L, 488717L, 1776463L), Type = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Chromosome",
"Position", "Type"), class = "data.frame", row.names = c(NA,
-6L))
Here is a general solution for drawing these kinds of plots, adapted from this post.
I chose to use geom_rect
for this, because it allowed for more fine-tuned adjustment of the shape sizes, and allows the shapes to scale with resolution; I think that geom_segment
widths do not scale.
Also note that using this method, the marks for gene alteration locations are drawn to scale, which means they might come out so thin as to be not easily visible on the plot; you can use your discretion to adjust it to some minimum size if you want.
library("ggplot2") # for the plot
library("ggrepel") # for spreading text labels on the plot, you can replace with `geom_text` if you want
library("scales") # for axis labels notation
# insert your steps to load data from tabular files or other sources here;
# dummy datasets taken directly from files shown in this example
# data with the copy number alterations for the sample
sample_cns <- structure(list(gene = c("AC116366.7", "ANKRD24", "APC", "SNAPC3",
"ARID1A", "ATM", "BOD1L1", "BRCA1", "C11orf65", "CHD5"), chromosome = c("chr5",
"chr19", "chr5", "chr9", "chr1", "chr11", "chr4", "chr17", "chr11",
"chr1"), start = c(131893016L, 4183350L, 112043414L, 15465517L,
27022894L, 108098351L, 13571634L, 41197694L, 108180886L, 6166339L
), end = c(131978056L, 4224502L, 112179823L, 15465578L, 27107247L,
108236235L, 13629211L, 41276113L, 108236235L, 6240083L), cn = c(1L,
1L, 1L, 7L, 1L, 1L, 3L, 3L, 1L, 1L), CNA = c("loss", "loss",
"loss", "gain", "loss", "loss", "gain", "gain", "loss", "loss"
)), .Names = c("gene", "chromosome", "start", "end", "cn", "CNA"
), row.names = c(NA, 10L), class = "data.frame")
# > head(sample_cns)
# gene chromosome start end cn CNA
# 1 AC116366.7 chr5 131893016 131978056 1 loss
# 2 ANKRD24 chr19 4183350 4224502 1 loss
# 3 APC chr5 112043414 112179823 1 loss
# 4 SNAPC3 chr9 15465517 15465578 7 gain
# 5 ARID1A chr1 27022894 27107247 1 loss
# 6 ATM chr11 108098351 108236235 1 loss
# hg19 chromosome sizes
chrom_sizes <- structure(list(chromosome = c("chrM", "chr1", "chr2", "chr3", "chr4",
"chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12",
"chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19",
"chr20", "chr21", "chr22", "chrX", "chrY"), size = c(16571L, 249250621L,
243199373L, 198022430L, 191154276L, 180915260L, 171115067L, 159138663L,
146364022L, 141213431L, 135534747L, 135006516L, 133851895L, 115169878L,
107349540L, 102531392L, 90354753L, 81195210L, 78077248L, 59128983L,
63025520L, 48129895L, 51304566L, 155270560L, 59373566L)), .Names = c("chromosome",
"size"), class = "data.frame", row.names = c(NA, -25L))
# > head(chrom_sizes)
# chromosome size
# 1 chrM 16571
# 2 chr1 249250621
# 3 chr2 243199373
# 4 chr3 198022430
# 5 chr4 191154276
# 6 chr5 180915260
# hg19 centromere locations
centromeres <- structure(list(chromosome = c("chr1", "chr2", "chr3", "chr4",
"chr5", "chr6", "chr7", "chr8", "chr9", "chrX", "chrY", "chr10",
"chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17",
"chr18", "chr19", "chr20", "chr21", "chr22"), start = c(121535434L,
92326171L, 90504854L, 49660117L, 46405641L, 58830166L, 58054331L,
43838887L, 47367679L, 58632012L, 10104553L, 39254935L, 51644205L,
34856694L, 16000000L, 16000000L, 17000000L, 35335801L, 22263006L,
15460898L, 24681782L, 26369569L, 11288129L, 13000000L), end = c(124535434L,
95326171L, 93504854L, 52660117L, 49405641L, 61830166L, 61054331L,
46838887L, 50367679L, 61632012L, 13104553L, 42254935L, 54644205L,
37856694L, 19000000L, 19000000L, 20000000L, 38335801L, 25263006L,
18460898L, 27681782L, 29369569L, 14288129L, 16000000L)), .Names = c("chromosome",
"start", "end"), class = "data.frame", row.names = c(NA, -24L
))
# > head(centromeres)
# chromosome start end
# 1 chr1 121535434 124535434
# 2 chr2 92326171 95326171
# 3 chr3 90504854 93504854
# 4 chr4 49660117 52660117
# 5 chr5 46405641 49405641
# 6 chr6 58830166 61830166
# create an ordered factor level to use for the chromosomes in all the datasets
chrom_order <- c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7",
"chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14",
"chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21",
"chr22", "chrX", "chrY", "chrM")
chrom_key <- setNames(object = as.character(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25)),
nm = chrom_order)
chrom_order <- factor(x = chrom_order, levels = rev(chrom_order))
# convert the chromosome column in each dataset to the ordered factor
chrom_sizes[["chromosome"]] <- factor(x = chrom_sizes[["chromosome"]],
levels = chrom_order)
sample_cns[["chromosome"]] <- factor(x = sample_cns[["chromosome"]],
levels = chrom_order)
centromeres[["chromosome"]] <- factor(x = centromeres[["chromosome"]],
levels = chrom_order)
# create a color key for the plot
group.colors <- c(gain = "red", loss = "blue")
ggplot(data = chrom_sizes) +
# base rectangles for the chroms, with numeric value for each chrom on the x-axis
geom_rect(aes(xmin = as.numeric(chromosome) - 0.2,
xmax = as.numeric(chromosome) + 0.2,
ymax = size, ymin = 0),
colour="black", fill = "white") +
# rotate the plot 90 degrees
coord_flip() +
# black & white color theme
theme(axis.text.x = element_text(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
# give the appearance of a discrete axis with chrom labels
scale_x_discrete(name = "chromosome", limits = names(chrom_key)) +
# add bands for centromeres
geom_rect(data = centromeres, aes(xmin = as.numeric(chromosome) - 0.2,
xmax = as.numeric(chromosome) + 0.2,
ymax = end, ymin = start)) +
# add bands for CNA value
geom_rect(data = sample_cns, aes(xmin = as.numeric(chromosome) - 0.2,
xmax = as.numeric(chromosome) + 0.2,
ymax = end, ymin = start, fill = CNA)) +
scale_fill_manual(values = group.colors) +
# add 'gain' gene markers
geom_text_repel(data = subset(sample_cns, sample_cns$CNA == "gain"),
aes(x = chromosome, y = start, label = gene),
color = "red", show.legend = FALSE) +
# add 'loss' gene markers
geom_text_repel(data = subset(sample_cns, sample_cns$CNA == "loss"),
aes(x = chromosome, y = start, label = gene ),
color = "blue", show.legend = FALSE) +
ggtitle("Copy Number Alterations") +
# supress scientific notation on the y-axis
scale_y_continuous(labels = comma) +
ylab("region (bp)")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With