Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repeat the rows in a data frame based on values in a specific column [duplicate]

Tags:

r

repeat

I would like to repeat entire rows in a data-frame based on the samples column.

My input:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'
df <- read.table(text=df, header=TRUE)

My expected output:

df <- 'chr start end  samples
        1   10   20   1-10-20-s1
        1   10   20   1-10-20-s2
        2   4    10   2-4-10-s1
        2   4    10   2-4-10-s2
        2   4    10   2-4-10-s3'

Some idea how to perform it wisely?

like image 535
user3091668 Avatar asked Jul 21 '16 08:07

user3091668


2 Answers

We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.

library(splitstackshape)
setDT(expandRows(df, "samples"))[,
     samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
#  chr start end    samples
#1:   1    10  20 1-10-20-s1
#2:   1    10  20 1-10-20-s2
#3:   2     4  10  2-4-10-s1
#4:   2     4  10  2-4-10-s2
#5:   2     4  10  2-4-10-s3

NOTE: data.table will be loaded when we load splitstackshape.

like image 141
akrun Avatar answered Nov 14 '22 22:11

akrun


You can achieve this using base R (i.e. avoiding data.tables), with the following code:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'

df <- read.table(text = df, header = TRUE)

duplicate_rows <- function(chr, starts, ends, samples) {
  expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
  repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)

  repeated_rows
}

expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)

new_df <- do.call(rbind, expanded_rows)

The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.

The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.

like image 25
Alex Ioannides Avatar answered Nov 14 '22 21:11

Alex Ioannides