Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform data frame

Tags:

dataframe

r

I have a questionnaire with an open-ended question like "Please name up to ten animals", which gives me the following data frame (where each letter stands for an animal):

nrow <- 1000
list <- vector("list", nrow)

for(i in 1:nrow){
  na <- rep(NA, sample(1:10, 1))
  list[[i]] <- sample(c(letters, na), 10, replace=FALSE)
}

df <- data.frame()
df <- rbind(df, do.call(rbind, list))

head(df)
# V1   V2 V3 V4   V5 V6   V7 V8 V9  V10
# 1  r <NA>  a  j    w  e    i  h  u    z
# 2  t    o  e  x    d  v <NA>  z  n    c
# 3  f    y  e  s    n  c    z  i  u    k
# 4  y <NA>  v  j    h  z    p  i  c    q
# 5  w    s  v  f <NA>  c    g  b  x    e
# 6  p <NA>  a  h    v  x    k  z  o <NA>

How can I transform this data frame to look like the following data frame? Remember that I don't actually know the column names.

 r <- 1000
 c <- length(letters)
 t1 <- matrix(rbinom(r*c,1,0.5),r,c)
 colnames(t1) <- letters
 head(t1)
 #      a b c d e f g h i j k l m n o p q r s t u v w x y z
 # [1,] 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0
 # [2,] 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1
 # [3,] 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
 # [4,] 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0
 # [5,] 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0
 # [6,] 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1
like image 504
not_a_number Avatar asked Dec 25 '22 19:12

not_a_number


2 Answers

td <-  data.frame(t(apply(df, 1, function(x) as.numeric( unique(unlist(df)) %in% x))))
colnames (td) <- unique(unlist(df))

letters could be replaced with a vector of animal names colnames(t1).

like image 53
germcd Avatar answered Feb 13 '23 03:02

germcd


You can do the following using tidyr which could be much faster than other approaches, though I like the approach by @germcd very much. You may need to tinker with the select, removing NAs as well as a blank space, which may be an artifact of the simulated data you provided:

require(tidyr)

##  Add an ID for each record:
df$id <- 1:nrow(df)

out <- (df %>% 
  gather(column, animal, -id) %>% 
  filter(animal != " ") %>% 
  spread(animal, column)
)

head(out)

This code gathers the unnamed columns into a long format, removes any empty columns or missing data, and then spreads by the unique values of the animal column. This also has the potentially desirable property of preserving the column order in which the animals were named. If it's not desirable then you could easily convert the resulting animal columns to numeric:

out_num <- out
out_num[,-1] <- as.numeric((!is.na(out[,-1])))
head(out_num)
like image 41
Forrest R. Stevens Avatar answered Feb 13 '23 02:02

Forrest R. Stevens