Pardon my new-ness to the R world, thank you kindly in advance for your help.
I would like to analyze the data from an experiment.
The data comes in in Long format, and it needs to be reshaped into wide, but I cannot figure out exactly how to do it. Most of the examples for melt/cast and reshape deal with much simpler dataframes.
Each time the subject answers a question on the experiment, his userid, location, age, and gender are recorded in a single row, then his experimental data on a series of questions are inputed next to those variables. Here's the thing, they may answer any number of questions on the experiment, and they may answer different items (it is quite complicated, but it must be this way).
The raw data looks something like this:
User_id, location, age, gender, Item, Resp
1, CA, 22, M, A, 1
1, CA, 22, M, B, -1
1, CA, 22, M, C, -1
1, CA, 22, M, D, 1
1, CA, 22, M, E,-1
2, MD, 27, F, A, -1
2, MD, 27, F, B, 1
2, MD, 27, F, C, 1
2, MD, 27, F, E, 1
2, MD, 27, F, G, -1
2, MD, 27, F, H, -1
I would like to reshape this data to have each user be on a single row, to look like this:
User_id, location, age, gender, A, B, C, D, E, F, G, H
1, CA, 22, M, 1, -1, -1, 1, -1, 0, 0, 0,
2, MD, 27, F, -1, 1, 1, 1, 0, 1, -1, -1
I think this is just a matter of finding the right reshape equation, but I've been at it for a couple of hours and I can't quite get what I want it too look like, since most of the examples do not have the repeated demographic data, and thus can just be rotated more simply. Very sorry if I have overlooked something simple.
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the data frame in a format...
Reshape from long to wide in R is also achieved using spread () and cast () function. Reshape from wide to long using reshape (), gather () and melt () function Reshape from long to wide using reshape (), spread () and dcast () function Let’s create a simple data frame to demonstrate our reshape example in R. view source print?
Data reshaping involves many steps in order to obtain desired or required format. One of the popular methods is melting the data which converts each row into a unique id-variable combination and then casting it. The two functions used for this process: It is used to convert a data frame into a molten data frame.
We have discussed melting and casting in R which is another way of transforming data. Reshape from wide to long in R is also achieved using gather () and melt () function. Reshape from long to wide in R is also achieved using spread () and cast () function.
Using data.table
you can do:
library(data.table)
> dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L)
User_id location age A B C D E G H
1: 1 CA 22 1 -1 -1 1 -1 0 0
2: 2 MD 27 -1 1 1 0 1 -1 -1
There’s a package called tidyr that makes melting and reshaping data frames much easier. In your case, you can use tidyr::spread
straightforwardly:
result = spread(df, Item, Resp)
This will however fill missing entries with NA
:
User_id location age gender A B C D E G H
1 1 CA 22 M 1 -1 -1 1 -1 NA NA
2 2 MD 27 F -1 1 1 NA 1 -1 -1
You can fix this by replacing them:
result[is.na(result)] = 0
result
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 2 2 MD 27 F -1 1 1 0 1 -1 -1
… or by using the fill
argument:
result = spread(df, Item, Resp, fill = 0)
For completeness’ sake, the other way round (i.e. reproducing the original data.frame
) works via gather
(this is usually known as “melting”):
gather(result, Item, Resp, A : H)
— The last argument here tells gather
which columns to gather (and it supports the concise range syntax).
Here's the always elegant stats::reshape
version
(newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4]))
# User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H
# 1 1 CA 22 M 1 -1 -1 1 -1 NA NA
# 6 2 MD 27 F -1 1 1 NA 1 -1 -1
Missing values get filled with NA
in reshape()
, and the names are not what we want. So we'll need to do a bit more work. Here we can change the names and replace the NA
s with zero in the same line to arrive at your desired result.
replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0)
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 6 2 MD 27 F -1 1 1 0 1 -1 -1
Of course, the code would definitely be more legible if we broke this up into two separate lines. Also, note that there is no F
in Item
in your original data, hence the difference in output from yours.
Data:
df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L,
22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M"
), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E",
" G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1,
1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender",
"Item", "Resp"), class = "data.frame", row.names = c(NA, -11L
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With