I have several dataframes, each with more than 250 variables. A partial dput
from the first dataframe:
df <- structure(list(id = structure(1:6, .Label = c("00", "01", "02", "03", "04", "05", "06", "08", "09", "10", "11", "12", "13", "14", "15", "All Recordings"), class = "factor"), Geslacht = structure(c(2L, 2L, 3L, 2L, 2L, 3L), .Label = c("-", "Man", "Vrouw"), class = "factor"), Leeftijd = structure(c(2L, 3L, 3L, 4L, 4L, 4L), .Label = c("-", "13", "14", "15", "17"), class = "factor"), FD.1.01 = c(13.96, 3.46, 2.45, 4.65, 1.18, 0.76), FD.1.02 = c(2.79, 4.32, 5.28, 0.78, 4.03, 0.74), FD.1.03 = c(2.09, 2.96, 5.78, 0.52, 1.12, 0), FD.1.04 = c(0, 2.79, 1.65, 0, 2.11, 2.11), FD.1.05 = c(1.26, 0.96, 8.67, 0.34, 1.77, 2.25), FD.1.06 = c(7.27, 0.58, 12.04, 0, 0.84, 3.39), FD.1.07 = c(3.97, 0.16, 8.37, 0.92, 0, 4.05), FD.1.08 = c(4.45, 0, 4.23, 0, 0, 1.63), FD.1.09 = c(0, 0, 2.07, 0, 0, 0.46), FD.1.10 = c(0, 0, 1.87, 0, 0, 0.42), FD.1.11 = c(0, 0, 9.05, 0, 0, 0), FD.1.12 = c(0, 0, 0, 0, 0, 0), FD.1.13 = c(0, 0, 0, 0, 0, 0), FD.1.14 = c(0, 0, 0, 0, 0, 0), FD.1.15 = c(0, 0, 0, 0, 0, 0), FD.1.16 = c(0, 0, 0, 0, 0, 0), FD.1.17 = c(0, 0, 0, 0, 0, 0), FD.1.18 = c(0, 0, 0, 0, 0, 0), FD.1.19 = c(0, 0, 0, 0, 0, 0), FD.1.20 = c(0, 0, 0, 0, 0, 0), FD.1.21 = c(0, 0, 0, 0, 0, 0), FD.1.22 = c(0, 0, 0, 0, 0, 0)), .Names = c("id", "Geslacht", "Leeftijd", "FD.1.01", "FD.1.02", "FD.1.03", "FD.1.04", "FD.1.05", "FD.1.06", "FD.1.07", "FD.1.08", "FD.1.09", "FD.1.10", "FD.1.11", "FD.1.12", "FD.1.13", "FD.1.14", "FD.1.15", "FD.1.16", "FD.1.17", "FD.1.18", "FD.1.19", "FD.1.20", "FD.1.21", "FD.1.22"), row.names = c(1L, 2L, 3L, 4L, 5L, 7L), class = "data.frame")
I want to reshape my dataframe from wide to long in the following way:
library(reshape2)
melted.df <- melt(df, id=c("id","Geslacht","Leeftijd"), measure.vars=c("all variables starting with FD"))
However, I do not know exactly which column names start with FD
. Moreover this number varies for the several dataframes and I have to do that for variables starting with other letter combinations as well.
Of course I can do this manually, but that takes more time and is prone to errors. A programmatic solution is therefore highly preferable.
Any suggestions how to approach this?
In order to be able to check where the problem arises with @akrun's dplyr
solution, a dput
of the first 6 rows and all columns:
df <- structure(list(id = structure(1:6, .Label = c("00", "01", "02", "03", "04", "05", "06", "08", "09", "10", "11", "12", "13", "14", "15", "All Recordings"), class = "factor"), Geslacht = structure(c(2L, 2L, 3L, 2L, 2L, 3L), .Label = c("-", "Man", "Vrouw"), class = "factor"), Leeftijd = structure(c(2L, 3L, 3L, 4L, 4L, 4L), .Label = c("-", "13", "14", "15", "17"), class = "factor"), FD.1.01 = c(13.96, 3.46, 2.45, 4.65, 1.18, 0.76), FD.1.02 = c(2.79, 4.32, 5.28, 0.78, 4.03, 0.74), FD.1.03 = c(2.09, 2.96, 5.78, 0.52, 1.12, 0), FD.1.04 = c(0, 2.79, 1.65, 0, 2.11, 2.11), FD.1.05 = c(1.26, 0.96, 8.67, 0.34, 1.77, 2.25), FD.1.06 = c(7.27, 0.58, 12.04, 0, 0.84, 3.39), FD.1.07 = c(3.97, 0.16, 8.37, 0.92, 0, 4.05), FD.1.08 = c(4.45, 0, 4.23, 0, 0, 1.63), FD.1.09 = c(0, 0, 2.07, 0, 0, 0.46), FD.1.10 = c(0, 0, 1.87, 0, 0, 0.42), FD.1.11 = c(0, 0, 9.05, 0, 0, 0), FD.1.12 = c(0, 0, 0, 0, 0, 0), FD.1.13 = c(0, 0, 0, 0, 0, 0), FD.1.14 = c(0, 0, 0, 0, 0, 0), FD.1.15 = c(0, 0, 0, 0, 0, 0), FD.1.16 = c(0, 0, 0, 0, 0, 0), FD.1.17 = c(0, 0, 0, 0, 0, 0), FD.1.18 = c(0, 0, 0, 0, 0, 0), FD.1.19 = c(0, 0, 0, 0, 0, 0), FD.1.20 = c(0, 0, 0, 0, 0, 0), FD.1.21 = c(0, 0, 0, 0, 0, 0), FD.1.22 = c(0, 0, 0, 0, 0, 0), Click.5.18 = c(0, 0, 0, 0, 0, 0), Click.5.19 = c(0, 0, 0, 0, 0, 0), Click.5.20 = c(0, 0, 0, 0, 0, 0), Click.5.21 = c(0, 0, 0, 0, 0, 0), Click.6.01 = c(0, 0, 0, 0, 0, 0), Click.6.02 = c(0, 0, 0, 0, 0, 0), Click.6.03 = c(0, 0, 0, 0, 0, 0), Click.6.04 = c(0, 0, 0, 0, 0, 0), Click.6.05 = c(0, 0, 0, 0, 0, 0), Click.6.06 = c(0, 0, 0, 0, 0, 0), Click.6.07 = c(0, 0, 0, 0, 0, 0), Click.6.08 = c(0, 0, 0, 0, 0, 0), Click.6.12 = c(0, 0, 0, 0, 0, 0), Click.6.13 = c(0, 0, 0, 0, 0, 0), Click.6.14 = c(0, 0, 0, 0, 0, 0), Click.6.15 = c(0, 0, 0, 0, 0, 0), Click.6.16 = c(0, 0, 0, 0, 0, 0), Click.6.17 = c(0, 0, 0, 0, 0, 0), Click.6.18 = c(0, 0, 0, 0, 0, 0), Click.6.19 = c(0, 0, 0, 0, 0, 0), Click.6.20 = c(0, 0, 0, 0, 0, 0), Click.6.21 = c(0, 0, 0, 0, 0, 0), Click.7.01 = c(0, 0, 0, 0, 0, 0), Click.7.02 = c(0, 0, 0, 0, 0, 0), Click.7.03 = c(0, 0, 0, 0, 1, 0), Click.7.04 = c(0, 0, 0, 0, 0, 0)), .Names = c("id", "Geslacht", "Leeftijd", "FD.1.01", "FD.1.02", "FD.1.03", "FD.1.04", "FD.1.05", "FD.1.06", "FD.1.07", "FD.1.08", "FD.1.09", "FD.1.10", "FD.1.11", "FD.1.12", "FD.1.13", "FD.1.14", "FD.1.15", "FD.1.16", "FD.1.17", "FD.1.18", "FD.1.19", "FD.1.20", "FD.1.21", "FD.1.22", "Click.5.18", "Click.5.19", "Click.5.20", "Click.5.21", "Click.6.01", "Click.6.02", "Click.6.03", "Click.6.04", "Click.6.05", "Click.6.06", "Click.6.07", "Click.6.08", "Click.6.12", "Click.6.13", "Click.6.14", "Click.6.15", "Click.6.16", "Click.6.17", "Click.6.18", "Click.6.19", "Click.6.20", "Click.6.21", "Click.7.01", "Click.7.02", "Click.7.03", "Click.7.04"), row.names = c(1L, 2L, 3L, 4L, 5L, 7L), class = "data.frame")
Try:
melt(df, id=c("id","Geslacht", "Leeftijd"),
measure.vars=grep("^FD", colnames(df)))
or
library(dplyr)
library(tidyr)
df %>% gather(FD, Score, FD.1.01:FD.1.22)
This also worked in the example you provided:
df %>%
gather(FD, Score, grep("^FD", colnames(df))) %>%
head()
id Geslacht Leeftijd FD Score
1 00 Man 13 FD.1.01 13.96
2 01 Man 14 FD.1.01 3.46
3 02 Vrouw 14 FD.1.01 2.45
4 03 Man 15 FD.1.01 4.65
5 04 Man 15 FD.1.01 1.18
6 05 Vrouw 15 FD.1.01 0.76
On a larger dataset,
newCols <- simplify2array(replicate(100,df[,-(1:3)]))
colnames(newCols) <- paste0("FD.1.", 23:2222)
df1 <- cbind(df, newCols)
df2 <- df1 %>%
gather(FD, Score, grep("^FD", colnames(df1)))
dim(df2)
#[1] 13332 5
Using the new example dataset with different colnames.
res1 <- df %>%
select(id, Geslacht, Leeftijd, grep("^FD",names(df))) %>%
gather(FD, Score, grep("^FD",names(df)))
Comparing the results with ?melt()
res2 <- melt(df, id=c("id","Geslacht", "Leeftijd"),
measure.vars=grep("^FD", colnames(df)))
colnames(res2) <- colnames(res1)
identical(res1,res2)
#[1] TRUE
grep
FD in colnames:
melted.df <- melt(df, id=c("id","Geslacht","Leeftijd"),
measure.vars=colnames(df)[grepl("^FD",colnames(df))])
Another and quite straightforward solution for this problem with dplyr
and tidyr
is:
melted.df <- df %>%
select(id, Geslacht, Leeftijd, starts_with("FD")) %>%
gather(FD, Score, starts_with("FD"))
or with data.table
:
melted.df <- melt(df, id = c("id","Geslacht","Leeftijd"),
measure.vars = patterns("^FD"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With