My data is grouped by the IDs in V6 and ordered by position (V1:V3): <pre class="prettyprint"><code>dt V1 V2 V3 V4 V5 V6 1: chr1 3054233 3054733 . + ENSMUSG00000090025 2: chr1 3102016 3102125 . + ENSMUSG00000064842 3: chr1 3205901 3207317 . - ENSMUSG00000051951 4: chr1 3206523 3207317 . - ENSMUSG00000051951 5: chr1 3213439 3215632 . - ENSMUSG00000051951 6: chr1 3213609 3216344 . - ENSMUSG00000051951 7: chr1 3214482 3216968 . - ENSMUSG00000051951 8: chr1 3421702 3421901 . - ENSMUSG00000051951 9: chr1 3466587 3466687 . + ENSMUSG00000089699 10: chr1 3513405 3513553 . + ENSMUSG00000089699 </code></pre> What I would like to do is to add and extra column with an index by position, that is, per group in V6 the first element would be "1", the second "2", and so on. I can achieve that using ddply and a custom function: <pre class="prettyprint"><code>rankExons <- function(x){ if(unique(x$V5) == "+"){ x$index <- seq_len(nrow(x))} else{ x$index <- rev(seq_len(nrow(x)))} x } indexed <- ddply(dt, .(V6), rankExons) indexed V1 V2 V3 V4 V5 V6 index 1 chr1 3205901 3207317 . - ENSMUSG00000051951 6 2 chr1 3206523 3207317 . - ENSMUSG00000051951 5 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 4 chr1 3213609 3216344 . - ENSMUSG00000051951 3 5 chr1 3214482 3216968 . - ENSMUSG00000051951 2 6 chr1 3421702 3421901 . - ENSMUSG00000051951 1 7 chr1 3102016 3102125 . + ENSMUSG00000064842 1 8 chr1 3466587 3466687 . + ENSMUSG00000089699 1 9 chr1 3513405 3513553 . + ENSMUSG00000089699 2 10 chr1 3054233 3054733 . + ENSMUSG00000090025 1 </code></pre> Unfortunately, it is extremely slow on the full dataset (~620k rows) and when using parallel it crashes and burns: <pre class="prettyprint"><code>library(doMC) registerDoMC(cores=6) indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE) Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Error: serialization is too large to store in a raw vector Warning message: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : all scheduled cores encountered errors in user code </code></pre> So , I went for data.table but couldn't get it working. Here is what I tried: <pre class="prettyprint"><code>setkey(dt, "V6") dt[,index:=rankExons(dt), by=V6] dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")] </code></pre> And both failed. How can I recreate my ddply with data.table? <pre class="prettyprint"><code>dput(dt) structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L, 3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L, 3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".", ".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+", "-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025", "ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699" )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table", "data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>) </code></pre>

As a fellow bioinformatician, I come across this operation quite frequently. And this is where I adore <code>data.table</code>'s modify subset of rows by reference feature! I'd do it like this: <pre class="prettyprint"><code>dt[V5 == "+", index := 1:.N, by=V6] dt[V5 == "-", index := .N:1, by=V6] </code></pre> No functions required. This is a little more advantageous because it avoids having to check for <code>==</code> <code>"+"</code> or <code>"-"</code> once for every group! Instead, you can first subset all groups with <code>+</code> once and then group by <code>V6</code> and modify just those rows in place! Similarly you do it once again for <code>"-"</code>. Hope that helps. <blockquote> Note: <code>.N</code> is a special variable that contains the number of observations per group. </blockquote>

First, I'll load your sample data into R (you can't currently use <code>dput()</code> with <code>data.table</code>): <pre class="prettyprint"><code>df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = " V1 V2 V3 V4 V5 V6 1 chr1 3205901 3207317 . - ENSMUSG00000051951 2 chr1 3206523 3207317 . - ENSMUSG00000051951 3 chr1 3213439 3215632 . - ENSMUSG00000051951 4 chr1 3213609 3216344 . - ENSMUSG00000051951 5 chr1 3214482 3216968 . - ENSMUSG00000051951 6 chr1 3421702 3421901 . - ENSMUSG00000051951 7 chr1 3102016 3102125 . + ENSMUSG00000064842 8 chr1 3466587 3466687 . + ENSMUSG00000089699 9 chr1 3513405 3513553 . + ENSMUSG00000089699 10 chr1 3054233 3054733 . + ENSMUSG00000090025") </code></pre> You can almost elegantly solve your problem with dplyr: <pre class="prettyprint"><code>library(dplyr) df %>% group_by(V6, V5) %>% mutate(index = row_number(V2)) </code></pre> (I've assume V2 is the variable you want to index by - I think it's better to be explicit rather than relying on the order row of the row) But you want a different summary for different subsets, which isn't currently easy in dplyr. One approach would be to split and then re-combine: <pre class="prettyprint"><code>rbind_list( df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)), df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2))) ) </code></pre> But this is going to be relatively slow since you have to make two copies of the data. Another approach would to be use an if inside the summary: <pre class="prettyprint"><code>df %>% group_by(V6, V5) %>% mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2))) </code></pre>

Create an "index" for each element of a group with data.table

Tags:

indexing

r

data.table

plyr

bioinformatics

My data is grouped by the IDs in V6 and ordered by position (V1:V3):

dt
      V1      V2      V3 V4 V5                 V6
 1: chr1 3054233 3054733  .  + ENSMUSG00000090025
 2: chr1 3102016 3102125  .  + ENSMUSG00000064842
 3: chr1 3205901 3207317  .  - ENSMUSG00000051951
 4: chr1 3206523 3207317  .  - ENSMUSG00000051951
 5: chr1 3213439 3215632  .  - ENSMUSG00000051951
 6: chr1 3213609 3216344  .  - ENSMUSG00000051951
 7: chr1 3214482 3216968  .  - ENSMUSG00000051951
 8: chr1 3421702 3421901  .  - ENSMUSG00000051951
 9: chr1 3466587 3466687  .  + ENSMUSG00000089699
10: chr1 3513405 3513553  .  + ENSMUSG00000089699

What I would like to do is to add and extra column with an index by position, that is, per group in V6 the first element would be "1", the second "2", and so on. I can achieve that using ddply and a custom function:

rankExons <- function(x){
   if(unique(x$V5) == "+"){ 
         x$index <- seq_len(nrow(x))}
   else{
         x$index <- rev(seq_len(nrow(x)))}
   x
}

indexed <- ddply(dt, .(V6), rankExons)
indexed
     V1      V2      V3 V4 V5                 V6 index
1  chr1 3205901 3207317  .  - ENSMUSG00000051951     6
2  chr1 3206523 3207317  .  - ENSMUSG00000051951     5
3  chr1 3213439 3215632  .  - ENSMUSG00000051951     4
4  chr1 3213609 3216344  .  - ENSMUSG00000051951     3
5  chr1 3214482 3216968  .  - ENSMUSG00000051951     2
6  chr1 3421702 3421901  .  - ENSMUSG00000051951     1
7  chr1 3102016 3102125  .  + ENSMUSG00000064842     1
8  chr1 3466587 3466687  .  + ENSMUSG00000089699     1
9  chr1 3513405 3513553  .  + ENSMUSG00000089699     2
10 chr1 3054233 3054733  .  + ENSMUSG00000090025     1

Unfortunately, it is extremely slow on the full dataset (~620k rows) and when using parallel it crashes and burns:

library(doMC)
registerDoMC(cores=6)
indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE)
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Warning message:
In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed,  :
  all scheduled cores encountered errors in user code

So , I went for data.table but couldn't get it working. Here is what I tried:

setkey(dt, "V6")

dt[,index:=rankExons(dt), by=V6]
dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")]

And both failed. How can I recreate my ddply with data.table?

dput(dt)
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L, 
3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L, 
3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 
3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".", 
".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+", 
"-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025", 
"ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951", 
"ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", 
"ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>)

555

asked Feb 09 '14 11:02

fridaymeetssunday

2 Answers

As a fellow bioinformatician, I come across this operation quite frequently. And this is where I adore data.table's modify subset of rows by reference feature!

I'd do it like this:

dt[V5 == "+", index := 1:.N, by=V6]
dt[V5 == "-", index := .N:1, by=V6]

No functions required. This is a little more advantageous because it avoids having to check for == "+" or "-" once for every group! Instead, you can first subset all groups with + once and then group by V6 and modify just those rows in place!

Similarly you do it once again for "-". Hope that helps.

Note: .N is a special variable that contains the number of observations per group.

184

answered Oct 19 '22 01:10

Arun

First, I'll load your sample data into R (you can't currently use dput() with data.table):

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
V1      V2      V3 V4 V5                 V6
1  chr1 3205901 3207317  .  - ENSMUSG00000051951
2  chr1 3206523 3207317  .  - ENSMUSG00000051951
3  chr1 3213439 3215632  .  - ENSMUSG00000051951
4  chr1 3213609 3216344  .  - ENSMUSG00000051951
5  chr1 3214482 3216968  .  - ENSMUSG00000051951
6  chr1 3421702 3421901  .  - ENSMUSG00000051951
7  chr1 3102016 3102125  .  + ENSMUSG00000064842
8  chr1 3466587 3466687  .  + ENSMUSG00000089699
9  chr1 3513405 3513553  .  + ENSMUSG00000089699
10 chr1 3054233 3054733  .  + ENSMUSG00000090025")

You can almost elegantly solve your problem with dplyr:

library(dplyr)

df %>% 
  group_by(V6, V5) %>%
  mutate(index = row_number(V2))

(I've assume V2 is the variable you want to index by - I think it's better to be explicit rather than relying on the order row of the row)

But you want a different summary for different subsets, which isn't currently easy in dplyr. One approach would be to split and then re-combine:

rbind_list(
  df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)),
  df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2)))
)

But this is going to be relatively slow since you have to make two copies of the data.

Another approach would to be use an if inside the summary:

df %>% 
  group_by(V6, V5) %>%
  mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2)))

answered Oct 19 '22 01:10

hadley

Related questions
                            
                                Error while reading csv file in R
                            
                                R re-arrange dataframe: some rows to columns
                            
                                Choropleth map in ggplot with polygons that have holes
                            
                                How to eliminate "NA/NaN/Inf in foreign function call (arg 7)" running predict with randomForest
                            
                                Partial animal string matching in R
                            
                                Daily time series with ts.. how to specify start and end [closed]
                            
                                How to represent a categorical predictor rstan?
                            
                                First-line paragraph indenting in PDFs using R Markdown
                            
                                How do I initialize an empty or example data table?
                            
                                Replace all values of a recursive list with values of a vector
                            
                                R: How to spread (jitter) points with respect to the x axis?
                            
                                Base R horizontal legend - with multiple rows
                            
                                Move R Shiny showNotification to center of screen
                            
                                R: Add title to Leaflet map
                            
                                Running and debugging a script and function in R
                            
                                Function Composition in R (and high level functions)
                            
                                How to easily visualize a matrix?
                            
                                Tail recursion on R Statistical Environment
                            
                                What does the "jitter" function do in R?
                            
                                R return the index of the minimum column for each row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With