I have the following data structure in R:
df <- structure(
list(
ID = c(1L, 2L, 3L, 4L, 5L),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = structure(
list(
var2a = c('v', 'w', 'x', 'y', 'z'),
var2b = c('vv', 'ww', 'xx', 'yy', 'zz')),
.Names = c('var2a', 'var2b'),
row.names = c(NA, 5L),
class = 'data.frame'),
var3 = c('aa', 'bb', 'cc', 'dd', 'ee')),
.Names = c('ID', 'var1', 'var2', 'var3'),
row.names = c(NA, 5L),
class = 'data.frame')
# Looks like this:
# ID var1 var2.var2a var2.var2b var3
# 1 1 a v vv aa
# 2 2 b w ww bb
# 3 3 c x xx cc
# 4 4 d y yy dd
# 5 5 e z zz ee
This looks like a normal data frame, and it behaves like that for the most part; but see length
and class
properties of the columns below:
class(df)
# [1] "data.frame"
df[1,]
# ID var1 var2.var2a var2.var2b var3
# 1 a v vv aa
dim(df)
# [1] 5 4
# One less than expected due to embedded data frame
lapply(df, class)
# $ID
# [1] "integer"
#
# $var1
# [1] "character"
#
# $var2
# [1] "data.frame"
#
# $var3
# [1] "character"
lapply(df, length)
# $ID
# [1] 5
#
# $var1
# [1] 5
#
# $var2
# [1] 2
#
# $var3
# [1] 5
# str(df)
# 'data.frame': 5 obs. of 4 variables:
# $ ID : int 1 2 3 4 5
# $ var1: chr "a" "b" "c" "d" ...
# $ var2:'data.frame': 5 obs. of 2 variables:
# ..$ var2a: chr "v" "w" "x" "y" ...
# ..$ var2b: chr "vv" "ww" "xx" "yy" ...
# $ var3: chr "aa" "bb" "cc" "dd" ...
My questions:
I've never come across this before. Is it a common format for some of you out there? What are potential use cases?
I called this "embedded" for lack of a better word. Somebody suggested "nested", but I don't think that's right, see separate section with tidyverse
tibble
s below.
I would have expected the structure
command above to fail, because I though that data.frames are essentially lists, where each element (column) has the same number of elements (rows). This rule seems violated in this example, as var2
has length = 2
(number of columns!). Yet, subsetting df
surprisingly succeeds in the usual way:
df[3,]
# ID var1 var2.var2a var2.var2b var3
# 3 3 c x xx cc
What's going on?
I don't think I could call it a "nested" structure, that terminology is used for nested data.frames
which would look and behave like this:
library(tidyverse)
df <- data_frame(
x = c(1L, 2L, 3L),
nested = list(data_frame(x = c('a', 'b', 'c')),
data_frame(x = c('a', 'b', 'c')),
data_frame(x = c('d', 'e', 'f'))))
unnest(df)
# # A tibble: 9 × 2
# x x
# <int> <chr>
# 1 1 a
# 2 1 b
# 3 1 c
# 4 2 a
# 5 2 b
# 6 2 c
# 7 3 d
# 8 3 e
# 9 3 f
I think the strucutre makes it pretty clear
str(df)
# 'data.frame': 5 obs. of 4 variables:
# $ ID : int 1 2 3 4 5
# $ var1: chr "a" "b" "c" "d" ...
# $ var2:'data.frame': 5 obs. of 2 variables:
# ..$ var2a: chr "v" "w" "x" "y" ...
# ..$ var2b: chr "vv" "ww" "xx" "yy" ...
# $ var3: chr "aa" "bb" "cc" "dd" ...
It's a data.frame with a column (var2
) that contains a data.frame. This isn't super easy to create so i'm not quite sure how you did it but it isn't technically "illegal" in R.
data.frames can contain matrices and other data.frames. So it doesn't just look at the length()
of the elements, it looks at the dim()
of the elements to see if it has the right number of "rows".
I often "fix" or expand these data.frames using
fixed <- do.call("data.frame", df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With