Data frame looks like this:
id pom.1 pom.2 pom.3 pom.4 pom.5 pom.6 pom.7 pom.8
20764422 1 3 <NA> <NA> <NA> <NA> <NA> <NA>
08049335 4 2 1 5 8 7 9 3
07668511 5 2 7 <NA> <NA> <NA> <NA> <NA>
20058102 7 4 2 <NA> <NA> <NA> <NA> <NA>
17318802 6 3 5 1 9 8 2 <NA>
where there is the list of 10 possible values that can be found in this data frame.
I need to create another dataframe which will have 10 columns, one for each value from the list and match it with original data frame.
New data frame should look like this:
id c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
20764422 y n y n n n n n n n
08049335 y y y y y n y y y n
07668511 n y n n y n y n n n
20058102 n y n y n n y n n n
17318802 y y y n y y n y y n
where each row (c1-c10) should match with one value from the list of values. Values "y" and "n" for each id mean that some value is/isn't present in original data frame.
Hopefully this exlanation is good enough to understand what needs to be done.
I tried to find an answer before posting, but either there was no answer or my search wasn't good enough. Anyway, sorry if I posted with answer already available here.
Thanks in advance!
If you can live with a binary 1 and 0 instead of a "y" and "n", you can try something like the following.
It helps if you provide a reproducible (dput
) or your data so that we know whether you're dealing with numeric, character, or factor variables.
library(data.table)
dcast(melt(as.data.table(mydf), "id"), id ~ value)
# Aggregate function missing, defaulting to 'length'
# id 1 2 3 4 5 6 7 8 9 NA
# 1: 7668511 0 1 0 0 1 0 1 0 0 5
# 2: 8049335 1 1 1 1 1 0 1 1 1 0
# 3: 17318802 1 1 1 0 1 1 0 1 1 1
# 4: 20058102 0 1 0 1 0 0 1 0 0 5
# 5: 20764422 1 0 1 0 0 0 0 0 0 6
If you really want to, you can do something like this:
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[ ## melt and remove NA
, value := factor(value, 1:10)], ## factor value column
id ~ value, ## pivot value by id
fun.aggregate = function(x) ifelse(is.na(x), "n", "y"), ## get your "y" and "n"
fill = "n", drop = FALSE) ## don't drop missing factors
Which yields:
## id 1 2 3 4 5 6 7 8 9 10
## 1: 07668511 n y n n y n y n n n
## 2: 08049335 y y y y y n y y y n
## 3: 17318802 y y y n y y n y y n
## 4: 20058102 n y n y n n y n n n
## 5: 20764422 y n y n n n n n n n
Here's a "for fun" answer using tabulate
and chartr
:
temp <- `rownames<-`(t(apply(mydf[-1], 1, function(x) tabulate(x, nbins = 10))), mydf[[1]])
temp[] <- chartr("01", "ny", temp)
temp
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# 20764422 "y" "n" "y" "n" "n" "n" "n" "n" "n" "n"
# 08049335 "y" "y" "y" "y" "y" "n" "y" "y" "y" "n"
# 07668511 "n" "y" "n" "n" "y" "n" "y" "n" "n" "n"
# 20058102 "n" "y" "n" "y" "n" "n" "y" "n" "n" "n"
# 17318802 "y" "y" "y" "n" "y" "y" "n" "y" "y" "n"
Sample data used in this answer (not necessarily what you have):
mydf <- structure(list(id = c("20764422", "08049335", "07668511", "20058102",
"17318802"), pom.1 = c(1L, 4L, 5L, 7L, 6L), pom.2 = c(3L, 2L,
2L, 4L, 3L), pom.3 = c(NA, 1L, 7L, 2L, 5L), pom.4 = c(NA, 5L,
NA, NA, 1L), pom.5 = c(NA, 8L, NA, NA, 9L), pom.6 = c(NA, 7L,
NA, NA, 8L), pom.7 = c(NA, 9L, NA, NA, 2L), pom.8 = c(NA, 3L,
NA, NA, NA)), .Names = c("id", "pom.1", "pom.2", "pom.3", "pom.4",
"pom.5", "pom.6", "pom.7", "pom.8"), row.names = c(NA, 5L), class = "data.frame")
If for some reason you don't want to use any package and want to stick to basic data.frame, you could do this pretty basically :
# Creating some data that looks like yours
> df <- data.frame(matrix(c(101:105,sample(c(1:10,NA),40,replace=T)),5,9,dimnames=list(x=NULL,y=c("id",sapply(1:8,function(x) paste("pom",x))))))
> print(df)
id pom.1 pom.2 pom.3 pom.4 pom.5 pom.6 pom.7 pom.8
1 101 2 NA 7 NA 5 1 NA 2
2 102 7 4 8 2 1 5 NA 4
3 103 6 8 5 2 9 8 2 7
4 104 9 NA 4 5 3 9 7 9
5 105 1 7 6 2 3 4 5 5
# Creating the output
> ndf <- t(apply(df,1,function(l) sapply(1:10,function(x) ifelse(x %in% l, 'y', 'n'))))
> dimnames(ndf) <- list(as.character(101:105),as.character(1:10))
> print(ndf)
1 2 3 4 5 6 7 8 9 10
101 "y" "y" "n" "n" "y" "n" "y" "n" "n" "n"
102 "y" "y" "n" "y" "y" "n" "y" "y" "n" "n"
103 "n" "y" "n" "n" "y" "y" "y" "y" "y" "n"
104 "n" "n" "y" "y" "y" "n" "y" "n" "y" "n"
105 "y" "y" "y" "y" "y" "y" "y" "n" "n" "n"
Note that I wouldn't recommend this solution for large datasets. @Ananda Mahto's solutions are probably the most optimal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With