I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or characters, and also a set of variable length data. For example:
id phrase num_tokens token_lengths
1 "hello world" 2 5 5
2 "greetings" 1 9
3 "take me to your leader" 4 4 2 2 4 6
The actual values are not all computable from one another, but that's the flavor of the data. The operations I'm going to want to do include subsetting the data based on boolean functions (e.g. something like nchar(data$phrase) > 10
or lapply(data$token_lengths, length) > 2)
. I'd also like to index and average values in the variable length portion by index. This doesn't work, but something like: mean(data$token_lengths[1], na.rm=TRUE))
I've found I can shoehorn "token_lengths" into a data.frame by making it an array:
d <- data.frame(id=c(1,2,3), ..., token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6)))
But is this the best way?
Dataframe: use the data. frame() function to combine variables together. Here you must use the cbind() function to “column bind” the variables. Notice how I can mix numeric columns with character columns, which is also not possible in matrices.
length() function in R Programming Language is used to get or set the length of a vector (list) or other objects.
There is a function in R that you can use (called the sort function) to sort your data in either ascending or descending order. The variable by which sort you can be a numeric, string or factor variable. You also have some options on how missing values will be handled: they can be listed first, last or removed.
Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.
This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)
as.mydata <- function(x)
{
UseMethod("as.mydata")
}
as.mydata.character <- function(x)
{
convert <- function(x)
{
md <- list()
md$phrase = x
spl <- strsplit(x, " ")[[1]]
md$num_words <- length(spl)
md$token_lengths <- nchar(spl)
class(md) <- "mydata"
md
}
lapply(x, convert)
}
Now your whole dataset looks like
mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))
mydataset
[[1]]
$phrase
[1] "hello world"
$num_words
[1] 2
$token_lengths
[1] 5 5
attr(,"class")
[1] "mydata"
[[2]]
$phrase
[1] "greetings"
$num_words
[1] 1
$token_lengths
[1] 9
attr(,"class")
[1] "mydata"
[[3]]
$phrase
[1] "take me to your leader"
$num_words
[1] 5
$token_lengths
[1] 4 2 2 4 6
attr(,"class")
[1] "mydata"
You can define a print method to make this look prettier.
print.mydata <- function(x)
{
cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.
The sample operations you wanted to do are fairly straightforward with data in this format.
sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1] TRUE FALSE TRUE
I would just use the data in the "long" format.
E.g.
> d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
> d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
> d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
> d <- merge(d1,d2)
> subset(d, nchar(phrase) > 10)
id num_words phrase token_length tokenid
1 1 2 hello world 5 1
2 1 2 hello world 5 2
4 3 4 take me to your leader 4 1
5 3 4 take me to your leader 2 2
6 3 4 take me to your leader 2 3
7 3 4 take me to your leader 4 4
8 3 4 take me to your leader 6 5
> with(d, tapply(token_length, id, mean))
1 2 3
5.0 9.0 3.6
Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.
Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [
, apply(), etc. would be applicable).
> d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
> m <- as.matrix(d)
> mode(m)
[1] "list"
> m[,"token_lengths"]
[[1]]
[1] 5 5
[[2]]
[1] 9
[[3]]
[1] 4 2 2 4 6
> m[3,]
$id
[1] 3
$num_tokens
[1] 4
$token_lengths
[1] 4 2 2 4 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With