Repeat the rows in a data frame based on values in a specific column [duplicate]

Question

I would like to repeat entire rows in a data-frame based on the samples column.

My input:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'
df <- read.table(text=df, header=TRUE)

My expected output:

df <- 'chr start end  samples
        1   10   20   1-10-20-s1
        1   10   20   1-10-20-s2
        2   4    10   2-4-10-s1
        2   4    10   2-4-10-s2
        2   4    10   2-4-10-s3'

Some idea how to perform it wisely?

akrun · Accepted Answer

We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.

library(splitstackshape)
setDT(expandRows(df, "samples"))[,
     samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
#  chr start end    samples
#1:   1    10  20 1-10-20-s1
#2:   1    10  20 1-10-20-s2
#3:   2     4  10  2-4-10-s1
#4:   2     4  10  2-4-10-s2
#5:   2     4  10  2-4-10-s3

NOTE: data.table will be loaded when we load splitstackshape.

Alex Ioannides · Answer

You can achieve this using base R (i.e. avoiding data.tables), with the following code:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'

df <- read.table(text = df, header = TRUE)

duplicate_rows <- function(chr, starts, ends, samples) {
  expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
  repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)

  repeated_rows
}

expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)

new_df <- do.call(rbind, expanded_rows)

The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.

The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.

Repeat the rows in a data frame based on values in a specific column [duplicate]

Tags:

r

repeat

user3091668

2 Answers

akrun

Alex Ioannides

Recent Activity

Donate For Us

Repeat the rows in a data frame based on values in a specific column [duplicate]

Tags:

r

repeat

user3091668

2 Answers

akrun

Alex Ioannides

Related questions

Recent Activity

Donate For Us