Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape data frame with different column lengths into two columns replicating column ID

I have the following data frame, with different row lengths:

myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
                             c("Walter","NA","NA","NA","NA"),
                             c("Walter","Jesse","NA","NA","NA"),
                             c("Gus","Tuco","Mike","NA","NA"), 
                             c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))   
data.frame(ID,myvar)

ID     V1    V2   V3   V4    V5
 1 Walter    NA   NA   NA    NA
 2 Walter    NA   NA   NA    NA
 3 Walter Jesse   NA   NA    NA
 4    Gus  Tuco Mike   NA    NA
 5    Gus  Mike Hank Saul Flynn

My goal is to switch this data frame into a two column data frame. The first column would be the ID and the other one would be the character name. Note that the ID must be correspondent to the row the character were originally placed. I'm expecting the following result:

ID      V
1  Walter    
2  Walter
3  Walter
3  Jesse
4  Gus
4  Tuco
4  Mike
5  Gus
5  Mike
5  Hank
5  Saul
5  Flynn

I've tried dcast {reshape2} but it doesn't returned what I need. It is noteworthy that my original data frame is quite big. Any tips? Cheers.

like image 698
ALS.Meyer Avatar asked Apr 07 '15 03:04

ALS.Meyer


4 Answers

You could use unlist

 res <- subset(data.frame(ID,value=unlist(myvar[-1], 
                              use.names=FALSE)), value!='NA')
 res
 #   ID  value
 #1   1 Walter
 #2   2 Walter
 #3   3 Walter
 #4   4    Gus
 #5   5    Gus
 #6   3  Jesse
 #7   4   Tuco
 #8   5   Mike
 #9   4   Mike
 #10  5   Hank
 #11  5   Saul
 #12  5  Flynn

NOTE: The NAs are 'character' elements in the dataset, it is better to create it without quotes so that it will be real NAs and we can remove it by na.omit, is.na, complete.cases etc.

data

myvar <- data.frame(ID,myvar)
like image 54
akrun Avatar answered Sep 29 '22 15:09

akrun


myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
                             c("Walter","NA","NA","NA","NA"),
                             c("Walter","Jesse","NA","NA","NA"),
                             c("Gus","Tuco","Mike","NA","NA"), 
                             c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))   
df <- data.frame(ID, myvar)

Using base reshape. (I'm converting your "NA" character strings to NA which you may not have to do, this is just due to how you created this example)

df[df == 'NA'] <- NA
na.omit(reshape(df, direction = 'long', varying = list(2:6))[, c('ID','V1')])

#     ID     V1
# 1.1  1 Walter
# 2.1  2 Walter
# 3.1  3 Walter
# 4.1  4    Gus
# 5.1  5    Gus
# 3.2  3  Jesse
# 4.2  4   Tuco
# 5.2  5   Mike
# 4.3  4   Mike
# 5.3  5   Hank
# 5.4  5   Saul
# 5.5  5  Flynn

or using reshape2

library('reshape2')
## na.omit(melt(df, id.vars = 'ID')[, c('ID','value')])

## or better yet as ananda suggests:
melt(df, id.vars = 'ID', na.rm = TRUE)[, c('ID','value')]

#    ID  value
# 1   1 Walter
# 2   2 Walter
# 3   3 Walter
# 4   4    Gus
# 5   5    Gus
# 8   3  Jesse
# 9   4   Tuco
# 10  5   Mike
# 14  4   Mike
# 15  5   Hank
# 20  5   Saul
# 25  5  Flynn

you get warnings that the factor levels over the columns are not the same but that's fine.

like image 35
rawr Avatar answered Sep 29 '22 16:09

rawr


Fix your "NA" so that they are actually NA first:

mydf[mydf == "NA"] <- NA

Using some subsetting to do it all in one fell swoop:

data.frame(ID=mydf$ID[row(mydf[-1])[!is.na(mydf[-1])]], V=mydf[-1][!is.na(mydf[-1])])

#   ID      V
#1   1 Walter
#2   2 Walter
#3   3 Walter
#4   4    Gus
#5   5    Gus
#6   3  Jesse
#7   4   Tuco
#8   5   Mike
#9   4   Mike
#10  5   Hank
#11  5   Saul
#12  5  Flynn

Or much more readable in base R:

sel <- which(!is.na(mydf[-1]), arr.ind=TRUE)
data.frame(ID=mydf$ID[sel[,1]], V=mydf[-1][sel])
like image 32
thelatemail Avatar answered Sep 29 '22 16:09

thelatemail


Using tidyr

library("tidyr")

myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
                             c("Walter","NA","NA","NA","NA"),
                             c("Walter","Jesse","NA","NA","NA"),
                             c("Gus","Tuco","Mike","NA","NA"), 
                             c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))   

myvar <- data.frame(ID,myvar)

myvar %>% 
    gather(ID, Name, V1:V5 ) %>%
    select(ID, value) %>%
    filter(value != "NA")

If your NAs are coded as NA instead of "NA", then we can in fact use the na.rm = TRUE option in gather. E.g.:

myvar[myvar == "NA"] <- NA
myvar %>% 
    gather(ID, Name, V1:V5, na.rm = TRUE ) %>%
    select(ID, value)

gives

   ID  value
1   1 Walter
2   2 Walter
3   3 Walter
4   4    Gus
5   5    Gus
6   3  Jesse
7   4   Tuco
8   5   Mike
9   4   Mike
10  5   Hank
11  5   Saul
12  5  Flynn
like image 33
Alex Avatar answered Sep 29 '22 17:09

Alex