Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining vectors of unequal length into a data frame

Tags:

r

I have a list of vectors which are time series of inequal length. My ultimate goal is to plot the time series in a ggplot2 graph. I guess I am better off first merging the vectors in a dataframe (where the shorter vectors will be expanded with NAs), also because I want to export the data in a tabular format such as .csv to be perused by other people.

I have a list that contains the names of all the vectors. It is fine that the column titles be set by the first vector, which is the longest. E.g.:

> mylist
[[1]]
[1] "vector1"

[[2]]
[1] "vector2"

[[3]]
[1] "vector3"

etc.

I know the way to go is to use Hadley's plyr package but I guess the problem is that my list contains the names of the vectors, not the vectors themselves, so if I type:

do.call(rbind, mylist)

I get a one-column df containing the names of the dfs I wanted to merge.

> do.call(rbind, actives)
      [,1]           
 [1,] "vector1" 
 [2,] "vector2" 
 [3,] "vector3" 
 [4,] "vector4" 
 [5,] "vector5" 
 [6,] "vector6" 
 [7,] "vector7" 
 [8,] "vector8" 
 [9,] "vector9" 
[10,] "vector10"

etc.

Even if I create a list with the object themselves, I get an empty dataframe :

mylist <- list(vector1, vector2)
mylist
[[1]]
        1         2         3         4         5         6         7         8         9        10        11        12 
0.1875000 0.2954545 0.3295455 0.2840909 0.3011364 0.3863636 0.3863636 0.3295455 0.2954545 0.3295455 0.3238636 0.2443182 
       13        14        15        16        17        18        19        20        21        22        23        24 
0.2386364 0.2386364 0.3238636 0.2784091 0.3181818 0.3238636 0.3693182 0.3579545 0.2954545 0.3125000 0.3068182 0.3125000 
       25        26        27        28        29        30        31        32        33        34        35        36 
0.2727273 0.2897727 0.2897727 0.2727273 0.2840909 0.3352273 0.3181818 0.3181818 0.3409091 0.3465909 0.3238636 0.3125000 
       37        38        39        40        41        42        43        44        45        46        47        48 
0.3125000 0.3068182 0.2897727 0.2727273 0.2840909 0.3011364 0.3181818 0.2329545 0.3068182 0.2386364 0.2556818 0.2215909 
       49        50        51        52        53        54        55        56        57        58        59        60 
0.2784091 0.2784091 0.2613636 0.2329545 0.2443182 0.2727273 0.2784091 0.2727273 0.2556818 0.2500000 0.2159091 0.2329545 
       61 
0.2556818 

[[2]]
        1         2         3         4         5         6         7         8         9        10        11        12 
0.2824427 0.3664122 0.3053435 0.3091603 0.3435115 0.3244275 0.3320611 0.3129771 0.3091603 0.3129771 0.2519084 0.2557252 
       13        14        15        16        17        18        19        20        21        22        23        24 
0.2595420 0.2671756 0.2748092 0.2633588 0.2862595 0.3549618 0.2786260 0.2633588 0.2938931 0.2900763 0.2480916 0.2748092 
       25        26        27        28        29        30        31        32        33        34        35        36 
0.2786260 0.2862595 0.2862595 0.2709924 0.2748092 0.3396947 0.2977099 0.2977099 0.2824427 0.3053435 0.3129771 0.2977099 
       37        38        39        40        41        42        43        44        45        46        47        48 
0.3320611 0.3053435 0.2709924 0.2671756 0.2786260 0.3015267 0.2824427 0.2786260 0.2595420 0.2595420 0.2442748 0.2099237 
       49        50        51        52        53        54        55        56        57        58        59        60 
0.2022901 0.2251908 0.2099237 0.2213740 0.2213740 0.2480916 0.2366412 0.2251908 0.2442748 0.2022901 0.1793893 0.2022901 

but

do.call(rbind.fill, mylist)
data frame with 0 columns and 0 rows

I have tried converting the vectors to dataframes, but there is no cbind.fill function, so plyr complains that the dataframes are of different length.

So my questions are:

  • Is this the best approach? Keep in mind that the goals are a) a ggplot2 graph and b) a table with the time series, to be viewed outside of R

  • What is the best way to get a list of objects starting with a list of the names of those objects?

  • What the best type of graph to highlight the patterns of 60 timeseries? The scale is the same, but I predict there'll be a lot of overplotting. Since this is a cohort analysis, it might be useful to use color to highlight the different cohorts in terms of recency (as a continuous variable). But how to avoid overplotting? The differences will be minimal so faceting might leave the viewer unable to grasp the difference.

like image 286
Roberto Avatar asked Jul 29 '10 18:07

Roberto


People also ask

How do I create a Dataframe with unequal length in R?

To create a data frame of unequal length, we add the NA value at the end of the columns which are smaller in the lengths and makes them equal to the column which has the maximum length among all and with this process all the length becomes equal and the user is able to process operations on that data frame in R ...

What is the use of Rbind () and Cbind () in R?

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.


2 Answers

I think that you may be approaching this the wrong way:

If you have time series of unequal length then the absolute best thing to do is to keep them as time series and merge them. Most time series packages allow this. So you will end up with a multi-variate time series and each value will be properly associated with the same date.

So put your time series into zoo objects, merge them, then use my qplot.zoo function to plot them. That will deal with switching from zoo into a long data frame.

Here's an example:

> z1 <- zoo(1:8, 1:8)
> z2 <- zoo(2:8, 2:8)
> z3 <- zoo(4:8, 4:8)
> nm <- list("z1", "z2", "z3")
> z <- zoo()
> for(i in 1:length(nm)) z <- merge(z, get(nm[[i]]))
> names(z) <- unlist(nm)
> z
  z1 z2 z3
1  1 NA NA
2  2  2 NA
3  3  3 NA
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
> 
> x.df <- data.frame(dates=index(x), coredata(x))
> x.df <- melt(x.df, id="dates", variable="val")
> ggplot(na.omit(x.df), aes(x=dates, y=value, group=val, colour=val)) + geom_line() + opts(legend.position = "none")
like image 166
Shane Avatar answered Sep 19 '22 13:09

Shane


If you're doing it just because ggplot2 (as well as many other things) like data frames then what you're missing is that you need the data in long format data frames. Yes, you just put all of your response variables in one column concatenated together. Then you would have 1 or more other columns that identify what makes those responses different. That's the best way to have it set up for things like ggplot.

like image 30
John Avatar answered Sep 18 '22 13:09

John