I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data. I have an array of data with three variables. <pre class="prettyprint"><code> gene_id fpkm meth_val 1 100629094 0.000 0.0063 2 100628995 0.000 0.0000 3 102655614 111.406 0.0021 </code></pre> I'd like to plot the average meth_val after stratifying my gene_ids based on fpkm into quartiles or deciles. Once I load my data into a dataframe... <pre class="prettyprint"><code>data <- read.delim("myfile.tsv", sep='\t') </code></pre> I can determine the fpkm deciles using: <pre class="prettyprint"><code>quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5 </code></pre> which yields <pre class="prettyprint"><code> 0% 10% 20% 30% 40% 50% 0.000000e+00 9.783032e-01 7.566164e+00 3.667630e+01 1.379986e+02 3.076280e+02 60% 70% 80% 90% 100% 5.470552e+02 8.875592e+02 1.486200e+03 2.974264e+03 1.958740e+05 </code></pre> From there, I'd like to essentially split the dataframe into 10 groups based on whether the fpkm_val fits into one of these deciles. Then I'd like to plot the meth_val of each decile in ggplot as a box plot and perform a statistical test across deciles. The main thing I'm really stuck on is how to split my dataset in the proper way. Any assistance would be hugely appreciated! Thanks a bunch!

Perhaps easier like this: <code>data$qunatil = cut( data$fpkm, quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5) )</code>

you can try using <code>Hmisc</code> library and <code>cut2</code> function. You can cut vector into different groups by stating the cutpoints. Here is an example: <pre class="prettyprint"><code>library(Hmisc) data <- data.frame(gene_id=sample(c("A","B","D", 100), 100, replace=TRUE), fpkm=abs(rnorm(100, 100, 10)), meth_val=abs(rnorm(100, 10, 1))) quantiles <- quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5) data$cutted <- cut2(data$fpkm, cuts = as.numeric(quantiles)) </code></pre> And you will get the same data frame with additional columns for split: <pre class="prettyprint"><code> gene_id fpkm meth_val cutted 1 B 102.16511 8.477469 [100.4,103.2) 2 A 110.59269 9.256172 [106.4,110.9) 3 B 93.15691 10.560936 [ 92.9, 95.3) 4 B 105.74879 10.301358 [103.2,106.4) 5 A 96.12755 11.336484 [ 95.3, 96.8) 6 B 106.29204 8.286120 [103.2,106.4) ... </code></pre> Moreover you can cut using <code>cut2</code> specifying by quantiles groups too. Read more <code>?cut2</code>.

R: splitting dataset into quartiles/deciles. What is the right method? [duplicate]

Tags:

dataframe

plot

r

I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data.

I have an array of data with three variables.

  gene_id       fpkm  meth_val
1 100629094     0.000 0.0063
2 100628995     0.000 0.0000
3 102655614   111.406 0.0021

I'd like to plot the average meth_val after stratifying my gene_ids based on fpkm into quartiles or deciles.

Once I load my data into a dataframe...

data <- read.delim("myfile.tsv", sep='\t')

I can determine the fpkm deciles using:

quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5

which yields

          0%          10%          20%          30%          40%          50%
0.000000e+00 9.783032e-01 7.566164e+00 3.667630e+01 1.379986e+02 3.076280e+02
         60%          70%          80%          90%         100%
5.470552e+02 8.875592e+02 1.486200e+03 2.974264e+03 1.958740e+05

From there, I'd like to essentially split the dataframe into 10 groups based on whether the fpkm_val fits into one of these deciles. Then I'd like to plot the meth_val of each decile in ggplot as a box plot and perform a statistical test across deciles.

The main thing I'm really stuck on is how to split my dataset in the proper way. Any assistance would be hugely appreciated!

Thanks a bunch!

252

asked Oct 09 '14 08:10

user1995839

3 Answers

Another way would be ntile() in dplyr.

library(tidyverse)

foo <- data.frame(a = 1:100,
                  b = runif(100, 50, 200),
                  stringsAsFactors = FALSE)

foo %>%
    mutate(quantile = ntile(b, 10))

#  a         b quantile
#1 1  93.94754        2
#2 2 172.51323        8
#3 3  99.79261        3
#4 4  81.55288        2
#5 5 116.59942        5
#6 6 128.75947        6

answered Nov 07 '22 09:11

jazzurro

Perhaps easier like this:

data$qunatil = cut( data$fpkm, quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5) )

answered Nov 07 '22 08:11

Adii_

you can try using Hmisc library and cut2 function. You can cut vector into different groups by stating the cutpoints. Here is an example:

library(Hmisc)
data <- data.frame(gene_id=sample(c("A","B","D", 100), 100, replace=TRUE),
               fpkm=abs(rnorm(100, 100, 10)),
               meth_val=abs(rnorm(100, 10, 1)))
quantiles <- quantile(data$fpkm, prob = seq(0, 1, length = 11), type = 5)
data$cutted <- cut2(data$fpkm, cuts = as.numeric(quantiles))

And you will get the same data frame with additional columns for split:

    gene_id      fpkm  meth_val        cutted
1         B 102.16511  8.477469 [100.4,103.2)
2         A 110.59269  9.256172 [106.4,110.9)
3         B  93.15691 10.560936 [ 92.9, 95.3)
4         B 105.74879 10.301358 [103.2,106.4)
5         A  96.12755 11.336484 [ 95.3, 96.8)
6         B 106.29204  8.286120 [103.2,106.4)
...

Moreover you can cut using cut2 specifying by quantiles groups too. Read more ?cut2.

answered Nov 07 '22 08:11

adomasb

Related questions
                            
                                Using nnet for prediction, am i doing it right?
                            
                                What does the double percentage sign (%%) mean?
                            
                                lib unspecified & Error in loadNamespace
                            
                                Create a histogram for weighted values
                            
                                Using R from Scala and invoking Scala from R?
                            
                                print or display variable inside function
                            
                                How to create base R plot 'type = b' equivalent in ggplot2?
                            
                                dplyr group by colnames described as vector of strings
                            
                                Replace column names in kable/R markdown
                            
                                What does c do in R? [duplicate]
                            
                                r modify and rebuild package
                            
                                How do I show all boxplot labels
                            
                                R: how to check whether a vector is ascending/descending
                            
                                Convert and save distance matrix to a specific format
                            
                                visualize a list of colors/palette in R
                            
                                How to remove columns with same value in R
                            
                                In R, What is the difference between df["x"] and df$x
                            
                                Create counter within consecutive runs of certain values
                            
                                Functions available for Tufte boxplots in R?
                            
                                How can I make a list of all dataframes that are in my global environment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With