Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape R data with user entries in rows, collapsing for each user

Tags:

r

reshape

Pardon my new-ness to the R world, thank you kindly in advance for your help.

I would like to analyze the data from an experiment.

The data comes in in Long format, and it needs to be reshaped into wide, but I cannot figure out exactly how to do it. Most of the examples for melt/cast and reshape deal with much simpler dataframes.

Each time the subject answers a question on the experiment, his userid, location, age, and gender are recorded in a single row, then his experimental data on a series of questions are inputed next to those variables. Here's the thing, they may answer any number of questions on the experiment, and they may answer different items (it is quite complicated, but it must be this way).

The raw data looks something like this:

User_id, location, age, gender, Item, Resp
1, CA, 22, M, A, 1 
1, CA, 22, M, B, -1 
1, CA, 22, M, C, -1 
1, CA, 22, M, D, 1 
1, CA, 22, M, E,-1
2, MD, 27, F, A, -1 
2, MD, 27, F, B, 1 
2, MD, 27, F, C, 1 
2, MD, 27, F, E, 1 
2, MD, 27, F, G, -1 
2, MD, 27, F, H, -1 

I would like to reshape this data to have each user be on a single row, to look like this:

User_id, location, age, gender, A, B, C, D, E, F, G, H
1, CA, 22, M, 1, -1, -1, 1, -1, 0, 0, 0, 
2, MD, 27, F, -1, 1, 1, 1, 0, 1, -1, -1

I think this is just a matter of finding the right reshape equation, but I've been at it for a couple of hours and I can't quite get what I want it too look like, since most of the examples do not have the repeated demographic data, and thus can just be rotated more simply. Very sorry if I have overlooked something simple.

like image 387
GFoMoFo Avatar asked Aug 17 '15 22:08

GFoMoFo


People also ask

What is data reshaping in R?

Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format...

How to reshape from long to wide in R?

Reshape from long to wide in R is also achieved using spread () and cast () function. Reshape from wide to long using reshape (), gather () and melt () function Reshape from long to wide using reshape (), spread () and dcast () function Let’s create a simple data frame to demonstrate our reshape example in R. view source print?

How to reshape data in Dataframe?

Data reshaping involves many steps in order to obtain desired or required format. One of the popular methods is melting the data which converts each row into a unique id-variable combination and then casting it. The two functions used for this process: It is used to convert a data frame into a molten data frame.

How to change the shape of data in R?

We have discussed melting and casting in R which is another way of transforming data. Reshape from wide to long in R is also achieved using gather () and melt () function. Reshape from long to wide in R is also achieved using spread () and cast () function.


3 Answers

Using data.table you can do:

library(data.table)
> dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L)
   User_id location age  A  B  C  D  E  G  H
1:       1       CA  22  1 -1 -1  1 -1  0  0
2:       2       MD  27 -1  1  1  0  1 -1 -1
like image 88
MichaelChirico Avatar answered Sep 29 '22 14:09

MichaelChirico


There’s a package called tidyr that makes melting and reshaping data frames much easier. In your case, you can use tidyr::spread straightforwardly:

result = spread(df, Item, Resp)

This will however fill missing entries with NA:

  User_id location age gender  A  B  C  D  E  G  H
1       1       CA  22      M  1 -1 -1  1 -1 NA NA
2       2       MD  27      F -1  1  1 NA  1 -1 -1

You can fix this by replacing them:

result[is.na(result)] = 0
result
#   User_id location age gender  A  B  C  D  E  G  H
# 1       1       CA  22      M  1 -1 -1  1 -1  0  0
# 2       2       MD  27      F -1  1  1  0  1 -1 -1

… or by using the fill argument:

result = spread(df, Item, Resp, fill = 0)

For completeness’ sake, the other way round (i.e. reproducing the original data.frame) works via gather (this is usually known as “melting”):

gather(result, Item, Resp, A : H)

— The last argument here tells gather which columns to gather (and it supports the concise range syntax).

like image 31
Konrad Rudolph Avatar answered Sep 29 '22 15:09

Konrad Rudolph


Here's the always elegant stats::reshape version

(newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4]))
#   User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H
# 1       1       CA  22      M       1      -1      -1       1      -1      NA      NA
# 6       2       MD  27      F      -1       1       1      NA       1      -1      -1

Missing values get filled with NA in reshape(), and the names are not what we want. So we'll need to do a bit more work. Here we can change the names and replace the NAs with zero in the same line to arrive at your desired result.

replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0)
#   User_id location age gender  A  B  C D  E  G  H
# 1       1       CA  22      M  1 -1 -1 1 -1  0  0
# 6       2       MD  27      F -1  1  1 0  1 -1 -1

Of course, the code would definitely be more legible if we broke this up into two separate lines. Also, note that there is no F in Item in your original data, hence the difference in output from yours.

Data:

df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L, 
22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M"
), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E", 
" G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1, 
1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender", 
"Item", "Resp"), class = "data.frame", row.names = c(NA, -11L
))
like image 35
Rich Scriven Avatar answered Sep 29 '22 14:09

Rich Scriven