Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data frame organisation

Tags:

dataframe

r

I'd like to analyse a sequence of rowing races in R where boats with 4 rowers each race pairwise against each other. I wonder about the best way to represent this in a data frame. I currently have 12 timed events, 2 such events constitute a race between two boats.

     time race boat seat1 seat2 seat3 seat4
1  204.98    1    1     2     6     1     5
2  202.49    2    1     4     5     2     7
3  202.27    3    1     2     6     3     7
4  206.48    4    1     1     7     2     8
5  204.85    5    1     4     8     2     6
6  204.93    6    1     2     8     3     5
7  204.91    1    2     3     7     4     8
8  207.40    2    2     1     8     3     6
9  207.62    3    2     1     5     4     8
10 203.41    4    2     3     5     4     6
11 205.04    5    2     3     7     1     5
12 204.96    6    2     4     6     1     7

Here the numbers in the seat columns refer to rowers (so there are 8 of them) but it would be more natural to use names or letters. I need to extract a 12x8 matrix that captures which rower participated in which event.

The code below builds the data frame above:

df <- data.frame ( 
                  time = c(204.98, 202.49, 202.27, 206.48, 204.85, 204.93,
                           204.91, 207.40, 207.62, 203.41, 205.04, 204.96),
                  race = append(1:6, 1:6),
                  boat = append(rep(1,6),rep(2,6)),
                  seat1 = c(2,4,2,1,4,2, 3,1,1,3,3,4),
                  seat2 = c(6,5,6,7,8,8, 7,8,5,5,7,6),
                  seat3 = c(1,2,3,2,2,3, 4,3,4,4,1,1),
                  seat4 = c(5,7,7,8,6,5, 8,6,8,6,5,7))

  1. To extract the relation between rowers and events, would it be better to organise this differently?
  2. Would it be natural to capture additional facts about rowers (like their weight, age) in a separate data frame or is it better (how?) to keep everything in one data frame.

It seems there is a tradeoff between redundancy and convenience. Whereas in a relational database one would use several relations it appears the R community prefers to share data in a single data frame. I am sure there is always a way to make it work but lacking the experience I'd be curious how experienced R users would organise the data.

Addendum: Lots of answers highlight the importance of the questions. Here is one that would benefit from bringing data into matrix form: the total time a rower spent in races: a vector of event times and a {0,1} valued matrix that connects events and rowers mentioned before. The result could be obtained by multiplying them.

like image 938
Christian Lindig Avatar asked May 18 '20 20:05

Christian Lindig


People also ask

How do I organize data in R?

There is a function in R that you can use (called the sort function) to sort your data in either ascending or descending order. The variable by which sort you can be a numeric, string or factor variable. You also have some options on how missing values will be handled: they can be listed first, last or removed.

How does Rbind work in R?

The rbind() function represents a row bind function for vectors, data frames, and matrices to be arranged as rows. It is used to combine multiple data frames for data manipulation.

Is data frame a data structure in R?

R provides a data structure, called a data frame, for collecting vectors into one object, which we can imagine as a table. More specifically, a data frame is an ordered collection of vectors, where the vectors must all be the same length but can be different types.

How do I bind data frames together in R?

Bind together two data frames by their rows or columns in R, To join two data frames by their rows, use the bind_rows() function from the dplyr package in R. Similarly, you may use dplyr's bind_cols() function to join two data frames based on their columns.


2 Answers

This is certainly a matter of opinion (totally agree with @MattB). Data frames are a very convenient way for many statistical analyses but many times you have to transform them to fit your purpose.

Your case shows a data frame in "wide form". I see no convenient way to add more facts about rowers. I would transform it to "long form". In the wide form each rower gets their own row. And since the rowers seem to be your "object of interest" (your cases) that could probably make things easier. The question "which races did rower 4 take part in?" could be answered easily with that form.

like image 163
Jan Avatar answered Sep 30 '22 14:09

Jan


To create a table of events vs. rowers melt the data into long form m and then back into the appropriate wide form. There is no reason you can't have the data in multiple forms so it is really not necessary to choose the best forms. You can always regenerate them if new data comes in. The form of interest really depends on what you want to do with it but the code below gives you three forms:

  1. the original wide form df,
  2. the long form m which could be useful for regression, boxplots, etc. e.g.

    lm(time ~ factor(rower) + 0, m)
    boxplot(time ~ boat, m)
    
  3. the revised wide form df2.

If there exists rower specific attributes then those could be stored in a separate data frame with one row per rower and one column per attribute and depending on what you want to do could be merged with m using merge if you want to use those in a regression, say.

library(data.table)

m <- melt(as.data.table(df), id = 1:3, value.name = "rower")
df2 <- dcast(data = m, time + race + boat ~ rower, value.var = "rower")
setkey(df2, boat, race) # sort
df2

giving:

      time race boat  1  2  3  4  5  6  7  8
 1: 204.98    1    1  1  2 NA NA  5  6 NA NA
 2: 202.49    2    1 NA  2 NA  4  5 NA  7 NA
 3: 202.27    3    1 NA  2  3 NA NA  6  7 NA
 4: 206.48    4    1  1  2 NA NA NA NA  7  8
 5: 204.85    5    1 NA  2 NA  4 NA  6 NA  8
 6: 204.93    6    1 NA  2  3 NA  5 NA NA  8
 7: 204.91    1    2 NA NA  3  4 NA NA  7  8
 8: 207.40    2    2  1 NA  3 NA NA  6 NA  8
 9: 207.62    3    2  1 NA NA  4  5 NA NA  8
10: 203.41    4    2 NA NA  3  4  5  6 NA NA
11: 205.04    5    2  1 NA  3 NA  5 NA  7 NA
12: 204.96    6    2  1 NA NA  4 NA  6  7 NA

Alternately, with dplyr/tidyr:

library(dplyr)
library(tidyr)

m <- df %>%
  pivot_longer(-(1:3), names_to = "seat", values_to = "rower")
df2 <- m %>% 
  pivot_wider(1:3, names_from = rower, values_from = rower, names_sort = TRUE)
like image 36
G. Grothendieck Avatar answered Sep 30 '22 12:09

G. Grothendieck