I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data.
I have an array of data with three variables.
  gene_id       fpkm  meth_val
1 100629094     0.000 0.0063
2 100628995     0.000 0.0000
3 102655614   111.406 0.0021
I'd like to plot the average meth_val after stratifying my gene_ids based on fpkm into quartiles or deciles.
Once I load my data into a dataframe...
data <- read.delim("myfile.tsv", sep='\t')
I can determine the fpkm deciles using:
quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5
which yields
          0%          10%          20%          30%          40%          50%
0.000000e+00 9.783032e-01 7.566164e+00 3.667630e+01 1.379986e+02 3.076280e+02
         60%          70%          80%          90%         100%
5.470552e+02 8.875592e+02 1.486200e+03 2.974264e+03 1.958740e+05
From there, I'd like to essentially split the dataframe into 10 groups based on whether the fpkm_val fits into one of these deciles. Then I'd like to plot the meth_val of each decile in ggplot as a box plot and perform a statistical test across deciles.
The main thing I'm really stuck on is how to split my dataset in the proper way. Any assistance would be hugely appreciated!
Thanks a bunch!
To place each data value into a decile, we can use the ntile(x, ngroups) function from the dplyr package in R. What is this? The way to interpret the output is as follows: The data value 56 falls between the percentile 0% and 10%, thus it falls in the first decile.
While data points can be dotted all over a graph at random, to organise them into quartiles you'll need to plot them on a number line. They're listed in ascending order and then divided into four quarters. Quartiles are quite similar to a median, simply dividing the data into four equal parts rather than two.
There are several formulae in vogue to calculate decile, and this method is one of the simplest one where each decile is calculated by adding one to the number of data in the population, then divide the sum by ten and then finally multiply the result by the rank of the decile, i.e., 1 for D1, 2 for D2… 9 for D9.
Another way would be ntile() in dplyr.
library(tidyverse)
foo <- data.frame(a = 1:100,
                  b = runif(100, 50, 200),
                  stringsAsFactors = FALSE)
foo %>%
    mutate(quantile = ntile(b, 10))
#  a         b quantile
#1 1  93.94754        2
#2 2 172.51323        8
#3 3  99.79261        3
#4 4  81.55288        2
#5 5 116.59942        5
#6 6 128.75947        6
Perhaps easier like this:
data$qunatil = cut( data$fpkm, quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5) )
you can try using Hmisc library and cut2 function. You can cut vector into different groups by stating the cutpoints. Here is an example:
library(Hmisc)
data <- data.frame(gene_id=sample(c("A","B","D", 100), 100, replace=TRUE),
               fpkm=abs(rnorm(100, 100, 10)),
               meth_val=abs(rnorm(100, 10, 1)))
quantiles <- quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5)
data$cutted <- cut2(data$fpkm, cuts = as.numeric(quantiles))
And you will get the same data frame with additional columns for split:
    gene_id      fpkm  meth_val        cutted
1         B 102.16511  8.477469 [100.4,103.2)
2         A 110.59269  9.256172 [106.4,110.9)
3         B  93.15691 10.560936 [ 92.9, 95.3)
4         B 105.74879 10.301358 [103.2,106.4)
5         A  96.12755 11.336484 [ 95.3, 96.8)
6         B 106.29204  8.286120 [103.2,106.4)
...
Moreover you can cut using cut2 specifying by quantiles groups too. Read more ?cut2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With