Best way to store variable-length data in an R data.frame?

Tags:

I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or characters, and also a set of variable length data. For example:

id  phrase                    num_tokens  token_lengths
1   "hello world"             2           5 5
2   "greetings"               1           9
3   "take me to your leader"  4           4 2 2 4 6

The actual values are not all computable from one another, but that's the flavor of the data. The operations I'm going to want to do include subsetting the data based on boolean functions (e.g. something like nchar(data$phrase) > 10 or lapply(data$token_lengths, length) > 2). I'd also like to index and average values in the variable length portion by index. This doesn't work, but something like: mean(data$token_lengths[1], na.rm=TRUE))

I've found I can shoehorn "token_lengths" into a data.frame by making it an array:

d <- data.frame(id=c(1,2,3), ..., token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6)))

But is this the best way?

621

asked Feb 23 '10 21:02

Nick

3 Answers

Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

as.mydata <- function(x)
{
   UseMethod("as.mydata")
}

as.mydata.character <- function(x)
{
   convert <- function(x)
   {
      md <- list()
      md$phrase = x
      spl <- strsplit(x, " ")[[1]]
      md$num_words <- length(spl)
      md$token_lengths <- nchar(spl)
      class(md) <- "mydata"
      md
   }
   lapply(x, convert)
}

Now your whole dataset looks like

mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))

mydataset
[[1]]
$phrase
[1] "hello world"

$num_words
[1] 2

$token_lengths
[1] 5 5

attr(,"class")
[1] "mydata"

[[2]]
$phrase
[1] "greetings"

$num_words
[1] 1

$token_lengths
[1] 9

attr(,"class")
[1] "mydata"

[[3]]
$phrase
[1] "take me to your leader"

$num_words
[1] 5

$token_lengths
[1] 4 2 2 4 6

attr(,"class")
[1] "mydata"

You can define a print method to make this look prettier.

print.mydata <- function(x)
{
   cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.

The sample operations you wanted to do are fairly straightforward with data in this format.

sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1]  TRUE FALSE  TRUE

195

answered Oct 08 '22 00:10

Richie Cotton

I would just use the data in the "long" format.

E.g.

> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
  id num_words                 phrase token_length tokenid
1  1         2            hello world            5       1
2  1         2            hello world            5       2
4  3         4 take me to your leader            4       1
5  3         4 take me to your leader            2       2
6  3         4 take me to your leader            2       3
7  3         4 take me to your leader            4       4
8  3         4 take me to your leader            6       5
> with(d, tapply(token_length, id, mean))
  1   2   3 
5.0 9.0 3.6

Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

answered Oct 08 '22 00:10

Eduardo Leoni

Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> m[3,]
$id
[1] 3

$num_tokens
[1] 4

$token_lengths
[1] 4 2 2 4 6

answered Oct 08 '22 00:10

hatmatrix

Related questions
                            
                                Adjusting figure margins in Rmarkdown
                            
                                Minimum Cost Flow - network optimization in R
                            
                                Select columns based on multiple attribute conditions
                            
                                Conditional filtering using tidyverse
                            
                                Error with select function from dplyr
                            
                                Image classification (raster stack) with random forest (package ranger)
                            
                                Include pattern in list.dirs
                            
                                How to use LaTeX Code in R Chunk in R-Markdown?
                            
                                Are Pandas' dataframes (Python) closer to R's dataframes or datatables? [closed]
                            
                                Joining two data frames with intervals misbehaves?
                            
                                R - ggplot line color (using geom_line) doesn't change
                            
                                What is Julia's equivalent of R's which?
                            
                                Replace loop with one of the functions of the "apply" family
                            
                                Shiny - Draw Right Border Of Column
                            
                                Suppress messages from underlying C-function in R
                            
                                Using pipes within map() in R
                            
                                Sort a data.table programmatically using character vector of multiple column names
                            
                                What is the difference between the color and fill argument in ggplot2?
                            
                                Change geom default aesthetics as part of theme component only
                            
                                Count Trailing and Leading NA for each vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to store variable-length data in an R data.frame?

Tags:

dataframe

r

Nick

People also ask

3 Answers

Richie Cotton

Eduardo Leoni

hatmatrix

Recent Activity

Donate For Us