Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using values from one data file to select rows from another data file

Tags:

r

csv

aggregate

I have data from multiple participants and I am trying to use information from one set of files to aggregate data in another set of files.

I have one data file (videodata.csv) with four columns: Event.Type (video name), Name (start or end), Row (row number), and Source (the participant id code). Set up like this:

videodata <- data.frame(
  Event.Type = c("Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3"),
  Name = c("start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end"),
  Row = c(1, 10, 11, 20, 21, 30,
           8, 16, 19, 28, 1, 7,
           2, 8, 21, 26, 9, 20),
  Source = c("012", "012", "012", "012", "012", "012",
             "013", "013", "013", "013", "013", "013",
             "014", "014", "014", "014", "014", "014")
)

I have like 80 participants, so each of them have another data file in csv format (id_data.csv), with two columns: time and score. Each row corresponds to one second, so the time column is simply numbered from 1 to somewhere around 1,800 or whatever it goes to. score is a binary code of 0 or 1. For participants 012, 013, and 014 data would look like this:

`012_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

`013_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

`014_data` <- data.frame(
  time = c(1:30),
  score = c(0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

I want to use the information from videodata.csv to slice up id_data.csv and save them into individual files corresponding to Event.Type.

Specifically, I would look in the videodata.csv file to find the id number. For a given id number, I would start with the first Event.Type (eg Movie1) and find the start row and end row. I would use this information to go to the data file for that id number and select those rows of score and save them to a file called movie1.csv. I would repeat for each participant until movie1.csv contained data from all participants.

For example, for participant 012, I would like to slice up the id_data.csv file into three chunks corresponding to Movie1, Movie2, and Movie3. To do this I would:

  1. I would select rows 1-10 (based on Row which indicates the Start and End) by rounding up to the nearest row number
  2. Save the score column to a file called Movie1 with the column header 012 from the Source column
  3. Repeat for participant 013 and append their score column next to the previous row.
  4. Then repeat for participants 014 and so on through 80.

The resulting file for Movie1 might look like this: enter image description here

It includes values of score from rows 1-10 (for 012) 8-16 (for 013) and 2-8 (for 014).

I would then have another file corresponding to Movie2 (by selecting rows 11-20, 19-28, and 21-26), and so on for each participant id. Each participant might have a slightly different number of rows.

This is just beyond my skill and I would appreciate any help as to how I can accomplish this. Thanks!

like image 795
socialresearcher Avatar asked Oct 21 '25 04:10

socialresearcher


1 Answers

As I assume that the numer of rows in each of the _data.csv files might differ I would suppose to combine the data in the following way:

data_list <- list()
data_list[["012_data"]] <- `012_data`
data_list[["013_data"]] <- `013_data`
data_list[["014_data"]] <- `014_data`

I attract your attention that here is a list not a dataframe because of possible variations of data structure and row count. The same effect can be achieved while reading files in a lapply function applied to list.files.

Then we create the "long" data.frame with data from such files.

rm(dl)

library(dplyr)
library(tidyr)

for (filename in names(data_list)) {
  if (exists("dl")) {
    dl <- dl %>% 
      union_all(
        data_list[[filename]] %>% 
          mutate(fileName = strsplit(filename, "_")[[1]][1]) # Here we use base R function
      )
  } else {
    dl <- data_list[[filename]] %>% 
      mutate(fileName = strsplit(filename, "_")[[1]][1])
  }
}

All these steps can be done at file upload.

dl is long and thus handy for further manipulations:

> dl
   time score fileName
1     1     0      012
2     2     0      012
3     3     0      012
4     4     0      012
5     5     1      012
...

All the rest is done in a single pipe:

videodata %>% 
  pivot_wider(names_from = Name, values_from = Row) %>% 
  left_join(dl, by = c("Source" = "fileName")) %>% 
  filter(time >= start& time <= end) %>% 
  select(-c(start, end, time)) %>% 
  group_by(Event.Type, Source) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = Source, values_from = score)

You can see the intermediate results executing the pipe stepwise. The pipe returns the following output:

# A tibble: 32 × 5
# Groups:   Event.Type [3]
   Event.Type    id `012` `013` `014`
   <chr>      <int> <dbl> <dbl> <dbl>
 1 Movie1         1     0     0     1
 2 Movie1         2     0     1     0
 3 Movie1         3     0     0     0
 4 Movie1         4     0     0     0
 5 Movie1         5     1     1     1
 6 Movie1         6     0     0     0
 7 Movie1         7     0     0     1
 8 Movie1         8     0     0    NA
 9 Movie1         9     1     1    NA
10 Movie1        10     0    NA    NA
11 Movie2         1     0     0     0
12 Movie2         2     0     0     0
13 Movie2         3     0     0     0
14 Movie2         4     0     0     0
15 Movie2         5     0     0     0
16 Movie2         6     0     0     0
17 Movie2         7     0     0    NA
18 Movie2         8     0     0    NA
19 Movie2         9     0     0    NA
20 Movie2        10     0     0    NA
21 Movie3         1     0     0     0
22 Movie3         2     0     0     0
23 Movie3         3     0     0     0
24 Movie3         4     0     0     0
25 Movie3         5     0     0     0
26 Movie3         6     0     0     0
27 Movie3         7     0     0     0
28 Movie3         8     0    NA     0
29 Movie3         9     0    NA     0
30 Movie3        10     0    NA     0
31 Movie3        11    NA    NA     0
32 Movie3        12    NA    NA     0

The other advantage of this code is that it does not contain any references to movie names and thus scalable without modifications as in other answer. It is robust to differences in number of items selected (see number of elements in Movie1 == 10 and Movie3 == 12). It will also work propwerly if the number of elements for 012 is less then for rightwards columns 013, 014 etc

like image 100
asd-tm Avatar answered Oct 22 '25 17:10

asd-tm