Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc

Tags:

r

reshape

qcc

I would like to convert a dataframe from long format to a wide format, but with unequal group sizes.

The eventual use will be in 'qcc', which requires a data frame or a matrix with each row consisting of one group, using NA's in groups which have fewer samples.

The following code will create an example dataset, as well as show manual conversion to the desired format.

# This is an example of the initial data that I have
# * 10 sample measurements, over 3 groups with 3, 2, and 5 elements respectively
x <- rnorm(10)
x_df <- data.frame( time = c( rep('2001 Q1',3), rep('2001 Q2',2), rep('2001 Q3',5) ), measure = x )
x_df

# This is a manual conversion into the desired format
x_pad <- c( x[1:3], NA, NA, x[4:5], NA, NA, NA, x[6:10] )
x_matrix <- matrix( x_pad, nrow = 3, ncol = 5, byrow = TRUE, dimnames = list(c('2001 Q1','2001 Q2','2001 Q3')) )
x_matrix # desired format

# An example of how it will be used
library(qcc)
plot(qcc(x_matrix, type = 'xbar', plot = FALSE))

So, I'd like to convert this:

      time     measure
1  2001 Q1  0.14680685
2  2001 Q1  0.53593193
3  2001 Q1  0.56097974
4  2001 Q2 -1.48102689
5  2001 Q2  0.18150972
6  2001 Q3  1.72018147
7  2001 Q3 -0.08480855
8  2001 Q3 -2.23208877
9  2001 Q3 -1.15269107
10 2001 Q3  0.57975023

... to this ...

              [,1]        [,2]       [,3]      [,4]      [,5]
2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
2001 Q2 -1.4810269  0.18150972         NA        NA        NA
2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

There is probably an easy way (perhaps some usage of reshape or reshape2 casting that I'm not familiar with?), but a bunch of searching hasn't helped me so far.

Thanks for any help!

==========

From one of the solutions below, the following will generate the final qcc xbar plot, including group labels:

library(splitstackshape)
out_df <- dcast( getanID( x_df, 'time' ), time~.id, value.var='measure' )
qcc( out_df[,-1], type = 'xbar', labels = out_df[,1] )
like image 334
jhchou Avatar asked Jul 29 '15 12:07

jhchou


3 Answers

You can create a sequence column ('.id') using getanID from splitstackshape and use dcast from data.table to convert the long format to wide format. The output of splitstackshape is a data.table. When we load splitstackshape, data.table will also be loaded. So, if you already have the devel version of data.table, then the dcast from data.table can be used as well.

library(splitstackshape)
dcast(getanID(df1, 'time'), time~.id, value.var='measure')
#     time          1           2          3         4         5
#1: 2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
#2: 2001 Q2 -1.4810269  0.18150972         NA        NA        NA
#3: 2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

Update

As @snoram mentioned in the comments, function rowid from data.table makes it easier to use just data.table alone

library(data.table)
dcast(setDT(df1), time ~ rowid(time), value.var = "measure")
like image 191
akrun Avatar answered Sep 28 '22 02:09

akrun


You'll need an intermediate variable that gives a "within-time" id. You can create it and reshape like this

library(tidyr)
library(dplyr)

group_by(X, time) %>%
  mutate(seq = 1:n()) %>%
  ungroup() %>%
  spread(seq, measure)
like image 25
Benjamin Avatar answered Sep 28 '22 00:09

Benjamin


Another splitstackshape approach

cSplit(setDT(df)[, toString(measure), by='time'], 'V1', ',')

#      time       V1_1        V1_2       V1_3      V1_4      V1_5
#1: 2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
#2: 2001 Q2 -1.4810269  0.18150972         NA        NA        NA
#3: 2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

Or using the devel version of data.table a similar approach after pasting together the 'measure' by the grouping column 'time' would be using tstrsplit to split the 'V1' column generated from toString(measure).

 setDT(df)[, toString(measure), by ='time'][, c(list(time), tstrsplit(V1, ', '))]

Also, we can add type.convert=TRUE in tstrsplit to convert the class of the split columns. By default it is FALSE.

like image 34
Veerendra Gadekar Avatar answered Sep 28 '22 01:09

Veerendra Gadekar