Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to store variable-length data in an R data.frame?

Tags:

dataframe

r

I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or characters, and also a set of variable length data. For example:

id  phrase                    num_tokens  token_lengths
1   "hello world"             2           5 5
2   "greetings"               1           9
3   "take me to your leader"  4           4 2 2 4 6

The actual values are not all computable from one another, but that's the flavor of the data. The operations I'm going to want to do include subsetting the data based on boolean functions (e.g. something like nchar(data$phrase) > 10 or lapply(data$token_lengths, length) > 2). I'd also like to index and average values in the variable length portion by index. This doesn't work, but something like: mean(data$token_lengths[1], na.rm=TRUE))

I've found I can shoehorn "token_lengths" into a data.frame by making it an array:

d <- data.frame(id=c(1,2,3), ..., token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6)))

But is this the best way?

like image 621
Nick Avatar asked Feb 23 '10 21:02

Nick


People also ask

How do I store a variable in a Dataframe in R?

Dataframe: use the data. frame() function to combine variables together. Here you must use the cbind() function to “column bind” the variables. Notice how I can mix numeric columns with character columns, which is also not possible in matrices.

Why the length () function is used in data frame in R?

length() function in R Programming Language is used to get or set the length of a vector (list) or other objects.

How do I organize data in R studio?

There is a function in R that you can use (called the sort function) to sort your data in either ascending or descending order. The variable by which sort you can be a numeric, string or factor variable. You also have some options on how missing values will be handled: they can be listed first, last or removed.


3 Answers

Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

as.mydata <- function(x)
{
   UseMethod("as.mydata")
}

as.mydata.character <- function(x)
{
   convert <- function(x)
   {
      md <- list()
      md$phrase = x
      spl <- strsplit(x, " ")[[1]]
      md$num_words <- length(spl)
      md$token_lengths <- nchar(spl)
      class(md) <- "mydata"
      md
   }
   lapply(x, convert)
}

Now your whole dataset looks like

mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))

mydataset
[[1]]
$phrase
[1] "hello world"

$num_words
[1] 2

$token_lengths
[1] 5 5

attr(,"class")
[1] "mydata"

[[2]]
$phrase
[1] "greetings"

$num_words
[1] 1

$token_lengths
[1] 9

attr(,"class")
[1] "mydata"

[[3]]
$phrase
[1] "take me to your leader"

$num_words
[1] 5

$token_lengths
[1] 4 2 2 4 6

attr(,"class")
[1] "mydata"

You can define a print method to make this look prettier.

print.mydata <- function(x)
{
   cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.

The sample operations you wanted to do are fairly straightforward with data in this format.

sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1]  TRUE FALSE  TRUE
like image 195
Richie Cotton Avatar answered Oct 08 '22 00:10

Richie Cotton


I would just use the data in the "long" format.

E.g.

> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
  id num_words                 phrase token_length tokenid
1  1         2            hello world            5       1
2  1         2            hello world            5       2
4  3         4 take me to your leader            4       1
5  3         4 take me to your leader            2       2
6  3         4 take me to your leader            2       3
7  3         4 take me to your leader            4       4
8  3         4 take me to your leader            6       5
> with(d, tapply(token_length, id, mean))
  1   2   3 
5.0 9.0 3.6 

Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

like image 22
Eduardo Leoni Avatar answered Oct 08 '22 00:10

Eduardo Leoni


Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5

[[2]]
[1] 9

[[3]]
[1] 4 2 2 4 6

> m[3,]
$id
[1] 3

$num_tokens
[1] 4

$token_lengths
[1] 4 2 2 4 6
like image 29
hatmatrix Avatar answered Oct 08 '22 00:10

hatmatrix