Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset rows incrementaly from different files

Tags:

r

subset

I have a couple of files (60) containing tables with identical column names. I write down an exemple of three of them.

what I need is to read from each file (for(i in files) source(i) ?) the data and create a single table out of them where I extract for each ID one single row (with all variables) from each file. The first row starts at a certain date which gets incremented in the next files by one.

For instance, the "starting" row in the first file would be row no. 10 , which start on 10th of Jan. 10 ID1 2000-01-10 20

Now, I jump to the next file, I search for the same ID and subset the row no. 11 (incremented by one day) which represents the dates from the next day, 2000-01-11 Step to the next file, the same ID, subset the row no. 12 (incremented by one compared to the previous file), etc. The number of rows to be subsetted for each ID equals files number. The iteration goes till we reach all rows for ID1 or the last file. Then, we repeat the operations above for the next ID, namely ID2 where the starting row comes from the first file as well and has the same starting date as ID1. It is also row number 10 relative to its ID but 22 in the file: 22 = 10 (ID1 + 2 incremental).

 Day =  seq(as.Date("2000-01-01"), length = 12, by = "days")                
file1: 
set.seed(123)
  df1 <- data.frame( rowid =  c(1:24),
          ID = c("ID1", "ID1","ID1", "ID1","ID1", "ID1","ID1", "ID1",  "ID1", "ID1","ID1", 
              "ID1","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2"),
          Date =  rep(Day, 2),
          Var1 = sample(1:30, 24, TRUE) )
file2:
set.seed(123)
   df2 <- data.frame( rowid =  c(1:24),
              ID = c("ID1", "ID1","ID1", "ID1","ID1", "ID1","ID1", "ID1",  "ID1", "ID1","ID1", "ID1","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2"),
              Date =  rep(Day, 2),
              Var1 = sample(40:60, 24, TRUE))
file3:
set.seed(123)
  df3 <- data.frame(rowid =  c(1:24),
              ID = c("ID1", "ID1","ID1", "ID1","ID1", "ID1","ID1", "ID1",  "ID1", "ID1","ID1", "ID1","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2","ID2", "ID2", "ID2", "ID2"),
              Date =  rep(Day, 2),
              Var1 = sample(70:100, 24, TRUE))

  
     

The output should look like:

              10 ID1 2000-01-10   20
              11 ID1 2000-01-11   44
              12 ID1 2000-01-12   83
              22 ID2 2000-01-10    9
              23 ID2 2000-01-11   50
              24 ID2 2000-01-12   98       

Any thought on how to get it? Thank you

like image 1000
gogo88 Avatar asked Dec 31 '22 14:12

gogo88


1 Answers

Preparing your input like so:

df_list <- list(df1, df2, df3)
start <- 10

You could do:

library(purrr)
library(dplyr)

imap_dfr(df_list, \(file, index) split(file, ~ ID) %>%
     map_dfr(~ slice(.x, start - 1 + index))) %>%
  arrange(ID)

Which returns the desired:

  rowid  ID       Date Var1
1    10 ID1 2000-01-10   20
2    11 ID1 2000-01-11   44
3    12 ID1 2000-01-12   83
4    22 ID2 2000-01-10    9
5    23 ID2 2000-01-11   50
6    24 ID2 2000-01-12   98
like image 86
ktiu Avatar answered Jan 16 '23 04:01

ktiu