Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Embedded" data.frame in R. What is it, what is it called, why does it behave the way it does?

I have the following data structure in R:

df <- structure(
  list(
    ID = c(1L, 2L, 3L, 4L, 5L),
    var1 = c('a', 'b', 'c', 'd', 'e'),
    var2 = structure(
      list(
        var2a = c('v', 'w', 'x', 'y', 'z'),
        var2b = c('vv', 'ww', 'xx', 'yy', 'zz')),
      .Names = c('var2a', 'var2b'),
      row.names = c(NA, 5L),
      class = 'data.frame'),
    var3 = c('aa', 'bb', 'cc', 'dd', 'ee')),
  .Names = c('ID', 'var1', 'var2', 'var3'),
  row.names = c(NA, 5L),
  class = 'data.frame')

# Looks like this:
#   ID var1 var2.var2a var2.var2b var3
# 1  1    a          v         vv   aa
# 2  2    b          w         ww   bb
# 3  3    c          x         xx   cc
# 4  4    d          y         yy   dd
# 5  5    e          z         zz   ee

This looks like a normal data frame, and it behaves like that for the most part; but see length and class properties of the columns below:

class(df)
# [1] "data.frame"

df[1,]
# ID var1 var2.var2a var2.var2b var3
# 1     a          v         vv   aa

dim(df)
# [1] 5 4
# One less than expected due to embedded data frame

lapply(df, class)
# $ID
# [1] "integer"
# 
# $var1
# [1] "character"
# 
# $var2
# [1] "data.frame"
# 
# $var3
# [1] "character"

lapply(df, length)
# $ID
# [1] 5
#
# $var1
# [1] 5
#
# $var2
# [1] 2
#
# $var3
# [1] 5
# str(df)

# 'data.frame': 5 obs. of  4 variables:
#   $ ID  : int  1 2 3 4 5
# $ var1: chr  "a" "b" "c" "d" ...
# $ var2:'data.frame':  5 obs. of  2 variables:
#   ..$ var2a: chr  "v" "w" "x" "y" ...
# ..$ var2b: chr  "vv" "ww" "xx" "yy" ...
# $ var3: chr  "aa" "bb" "cc" "dd" ...

My questions:

1) What is this?

I've never come across this before. Is it a common format for some of you out there? What are potential use cases?

2) What is this called?

I called this "embedded" for lack of a better word. Somebody suggested "nested", but I don't think that's right, see separate section with tidyverse tibbles below.

3) Why is it allowed?

I would have expected the structure command above to fail, because I though that data.frames are essentially lists, where each element (column) has the same number of elements (rows). This rule seems violated in this example, as var2 has length = 2 (number of columns!). Yet, subsetting df surprisingly succeeds in the usual way:

df[3,]
#   ID var1 var2.var2a var2.var2b var3
# 3  3    c          x         xx   cc

What's going on?


I don't think I could call it a "nested" structure, that terminology is used for nested data.frames which would look and behave like this:

library(tidyverse)
df <- data_frame(
  x = c(1L, 2L, 3L),
  nested = list(data_frame(x = c('a', 'b', 'c')), 
                data_frame(x = c('a', 'b', 'c')), 
                data_frame(x = c('d', 'e', 'f'))))
unnest(df)
# # A tibble: 9 × 2
#       x     x
#   <int> <chr>
# 1     1     a
# 2     1     b
# 3     1     c
# 4     2     a
# 5     2     b
# 6     2     c
# 7     3     d
# 8     3     e
# 9     3     f
like image 807
JanLauGe Avatar asked Apr 07 '17 15:04

JanLauGe


1 Answers

I think the strucutre makes it pretty clear

str(df)
# 'data.frame':   5 obs. of  4 variables:
#  $ ID  : int  1 2 3 4 5
#  $ var1: chr  "a" "b" "c" "d" ...
#  $ var2:'data.frame':   5 obs. of  2 variables:
#   ..$ var2a: chr  "v" "w" "x" "y" ...
#   ..$ var2b: chr  "vv" "ww" "xx" "yy" ...
#  $ var3: chr  "aa" "bb" "cc" "dd" ...

It's a data.frame with a column (var2) that contains a data.frame. This isn't super easy to create so i'm not quite sure how you did it but it isn't technically "illegal" in R.

data.frames can contain matrices and other data.frames. So it doesn't just look at the length() of the elements, it looks at the dim() of the elements to see if it has the right number of "rows".

I often "fix" or expand these data.frames using

fixed <- do.call("data.frame", df)
like image 186
MrFlick Avatar answered Oct 23 '22 10:10

MrFlick