R data frame organisation

Tags:

I'd like to analyse a sequence of rowing races in R where boats with 4 rowers each race pairwise against each other. I wonder about the best way to represent this in a data frame. I currently have 12 timed events, 2 such events constitute a race between two boats.

     time race boat seat1 seat2 seat3 seat4
1  204.98    1    1     2     6     1     5
2  202.49    2    1     4     5     2     7
3  202.27    3    1     2     6     3     7
4  206.48    4    1     1     7     2     8
5  204.85    5    1     4     8     2     6
6  204.93    6    1     2     8     3     5
7  204.91    1    2     3     7     4     8
8  207.40    2    2     1     8     3     6
9  207.62    3    2     1     5     4     8
10 203.41    4    2     3     5     4     6
11 205.04    5    2     3     7     1     5
12 204.96    6    2     4     6     1     7

Here the numbers in the seat columns refer to rowers (so there are 8 of them) but it would be more natural to use names or letters. I need to extract a 12x8 matrix that captures which rower participated in which event.

The code below builds the data frame above:

df <- data.frame ( 
                  time = c(204.98, 202.49, 202.27, 206.48, 204.85, 204.93,
                           204.91, 207.40, 207.62, 203.41, 205.04, 204.96),
                  race = append(1:6, 1:6),
                  boat = append(rep(1,6),rep(2,6)),
                  seat1 = c(2,4,2,1,4,2, 3,1,1,3,3,4),
                  seat2 = c(6,5,6,7,8,8, 7,8,5,5,7,6),
                  seat3 = c(1,2,3,2,2,3, 4,3,4,4,1,1),
                  seat4 = c(5,7,7,8,6,5, 8,6,8,6,5,7))

To extract the relation between rowers and events, would it be better to organise this differently?
Would it be natural to capture additional facts about rowers (like their weight, age) in a separate data frame or is it better (how?) to keep everything in one data frame.

It seems there is a tradeoff between redundancy and convenience. Whereas in a relational database one would use several relations it appears the R community prefers to share data in a single data frame. I am sure there is always a way to make it work but lacking the experience I'd be curious how experienced R users would organise the data.

Addendum: Lots of answers highlight the importance of the questions. Here is one that would benefit from bringing data into matrix form: the total time a rower spent in races: a vector of event times and a {0,1} valued matrix that connects events and rowers mentioned before. The result could be obtained by multiplying them.

938

asked May 18 '20 20:05

Christian Lindig

2 Answers

This is certainly a matter of opinion (totally agree with @MattB). Data frames are a very convenient way for many statistical analyses but many times you have to transform them to fit your purpose.

Your case shows a data frame in "wide form". I see no convenient way to add more facts about rowers. I would transform it to "long form". In the wide form each rower gets their own row. And since the rowers seem to be your "object of interest" (your cases) that could probably make things easier. The question "which races did rower 4 take part in?" could be answered easily with that form.

163

answered Sep 30 '22 14:09

Jan

To create a table of events vs. rowers melt the data into long form m and then back into the appropriate wide form. There is no reason you can't have the data in multiple forms so it is really not necessary to choose the best forms. You can always regenerate them if new data comes in. The form of interest really depends on what you want to do with it but the code below gives you three forms:

the original wide form df,
the long form m which could be useful for regression, boxplots, etc. e.g.
```
lm(time ~ factor(rower) + 0, m)
boxplot(time ~ boat, m)
```
the revised wide form df2.

If there exists rower specific attributes then those could be stored in a separate data frame with one row per rower and one column per attribute and depending on what you want to do could be merged with m using merge if you want to use those in a regression, say.

library(data.table)

m <- melt(as.data.table(df), id = 1:3, value.name = "rower")
df2 <- dcast(data = m, time + race + boat ~ rower, value.var = "rower")
setkey(df2, boat, race) # sort
df2

giving:

      time race boat  1  2  3  4  5  6  7  8
 1: 204.98    1    1  1  2 NA NA  5  6 NA NA
 2: 202.49    2    1 NA  2 NA  4  5 NA  7 NA
 3: 202.27    3    1 NA  2  3 NA NA  6  7 NA
 4: 206.48    4    1  1  2 NA NA NA NA  7  8
 5: 204.85    5    1 NA  2 NA  4 NA  6 NA  8
 6: 204.93    6    1 NA  2  3 NA  5 NA NA  8
 7: 204.91    1    2 NA NA  3  4 NA NA  7  8
 8: 207.40    2    2  1 NA  3 NA NA  6 NA  8
 9: 207.62    3    2  1 NA NA  4  5 NA NA  8
10: 203.41    4    2 NA NA  3  4  5  6 NA NA
11: 205.04    5    2  1 NA  3 NA  5 NA  7 NA
12: 204.96    6    2  1 NA NA  4 NA  6  7 NA

Alternately, with dplyr/tidyr:

library(dplyr)
library(tidyr)

m <- df %>%
  pivot_longer(-(1:3), names_to = "seat", values_to = "rower")
df2 <- m %>% 
  pivot_wider(1:3, names_from = rower, values_from = rower, names_sort = TRUE)

answered Sep 30 '22 12:09

G. Grothendieck

Related questions
                            
                                "circular" mean in R
                            
                                Add sparkline graph to a table
                            
                                How to make code chunks depend on all previous chunks in knitr/rmarkdown?
                            
                                Create a gif from a series of Leaflet maps in R
                            
                                Print a list of dynamically-sized plots in knitr
                            
                                How to get correct order of tip labels in APE after calling ladderize function
                            
                                Add discrete labels to ggplot2 plot with continuous scale
                            
                                Space between gpplot2 horizontal legend elements
                            
                                Add multiple lines to a plot_ly graph with add_trace
                            
                                Forcing R (and Rstudio) to use the virtual memory on Windows
                            
                                R: Exit from the calling function
                            
                                Find time to nearest occurrence of particular value for each row
                            
                                How to make plotly axes display only integer numbers
                            
                                read_csv() parsing error message, how to interpret?
                            
                                Controlling row height in kableExtra()
                            
                                Capture the printed output from a function (but still return its value) in R
                            
                                Conditionally replace values in one column with values from another column using dplyr [duplicate]
                            
                                How to prevent blogdown from rerendering all posts?
                            
                                How can I write a recursive compose function in R?
                            
                                How to connect R to MySQL? Failed to connect to database: Error: Plugin caching_sha2_password could not be loaded

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R data frame organisation

Tags:

dataframe

r

Christian Lindig

People also ask

2 Answers

Jan

G. Grothendieck

Recent Activity

Donate For Us