Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to turn multiple columns into observations [duplicate]

I have a data frame like this:

structure(list(one = structure(1:4, .Label = c("a", "b", "c", 
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a", 
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a", 
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA, 
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one", 
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA, 
-4L), class = "data.frame")

As you can see, the observations per event a, b, c, and d (variable "one") are stored as columns, where x and y define separate observations and 1, 2 and 3 define the variables. Variable "two" does not have a meaning here.

I like to reshape this data frame to have it tidy in the form that each observation has it's own row and each variable it's own column.

The final data frame should look like this:

structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a", 
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3), 
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a", 
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b", 
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1", 
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")

I am slightly familiar with what the cast and melt function from the reshape packages do, but was not able yet to figure out a way to reshape the DF in a smart way. For now the following provides the sate that I have gotten to:

df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)

Now the "variable" column does correctly identify the variable (1, 2 or 3). However, I was not able to cast this into the required form and this solution does not seem to be useful for larger sets of data due to the use of grepl.

Happy to get a nudge into the right direction here.

like image 924
fr3d-5 Avatar asked Dec 24 '22 14:12

fr3d-5


2 Answers

We can use melt from the devel version of data.table i.e. v1.9.5, which can handle multiple patterns for the measure variables.

library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
      na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
#   one two var1 var2 var3
#1:   a   2  x1a  x2a  x3a
#2:   b   4  x1b  x2b  x3b
#3:   b   4  y1b  y2b  y3b
#4:   c   7  x1c  x2c  x3c
#5:   c   7  y1c  y2c  y3c
#6:   d   3  x1d  x2d  x3d

EDIT: We don't need c inside the patterns and it will also give exact matches (from @Jaap's comments).

like image 130
akrun Avatar answered Jan 15 '23 02:01

akrun


melt from "data.table" will be much faster than the following, but you could also consider merged.stack from my "splitstackshape" package:

library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
                     sep = "var.stubs", atStart = FALSE))

#    one two .time_1  .1  .2  .3
# 1:   a   2       x x1a x2a x3a
# 2:   b   4       x x1b x2b x3b
# 3:   b   4       y y1b y2b y3b
# 4:   c   7       x x1c x2c x3c
# 5:   c   7       y y1c y2c y3c
# 6:   d   3       x x1d x2d x3d
like image 37
A5C1D2H2I1M1N2O1R2T1 Avatar answered Jan 15 '23 02:01

A5C1D2H2I1M1N2O1R2T1