Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Value of last non-empty column for each row

Tags:

string

r

dplyr

Take this sample data:

data.frame(a_1=c("Apple","Grapes","Melon","Peach"),a_2=c("Nuts","Kiwi","Lime","Honey"),a_3=c("Plum","Apple",NA,NA),a_4=c("Cucumber",NA,NA,NA)) 

   a_1    a_2   a_3     a_4
1  Apple  Nuts  Plum    Cucumber
2 Grapes  Kiwi  Apple    <NA>
3  Melon  Lime  <NA>     <NA>
4  Peach  Honey  <NA>    <NA>

Basically I want to run a grep on the last column of each row which is not NA. Thus my x in grep("pattern",x) should be:

Cucumber
Apple
Lime
Honey

I have an integer which tells me which a_N is the last one:

numcol <- rowSums(!is.na(df[,grep("(^a_)\\d", colnames(df))])) 

So far I have tried something like this in combination with ave(), apply() and dplyr:

grepl("pattern",df[,sprintf("a_%i",numcol)])

However I dont quite can make it work. Keep in mind that my dataset is very large thus I was hoping vor a vectorized solution or mb dplyr. Help would be greatly appreciated.

/e: Thanks, that is a really good solution. My thinking was too complicated. (the regex is due to my more specific data )

like image 399
cover51 Avatar asked Sep 16 '14 19:09

cover51


2 Answers

There's no need for regex here. Just use apply + tail + na.omit:

> apply(mydf, 1, function(x) tail(na.omit(x), 1))
[1] "Cucumber" "Apple"    "Lime"     "Honey" 

I don't know how this compares in terms of speed, but you You can also use a combination of "data.table" and "reshape2", like this:

library(data.table)
library(reshape2)
na.omit(melt(as.data.table(mydf, keep.rownames = TRUE), 
             id.vars = "rn"))[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

Or, even better:

melt(as.data.table(df, keep.rownames = TRUE), 
     id.vars = "rn", na.rm = TRUE)[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

This would be much faster. On an 800k-row dataset, apply took ~ 50 seconds while the data.table approach took about 2.5 seconds.

like image 102
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 07 '22 15:11

A5C1D2H2I1M1N2O1R2T1


Another alternative that might be pretty fast:

DF[cbind(seq_len(nrow(DF)), max.col(!is.na(DF), "last"))]
#[1] "Cucumber" "Apple"    "Lime"     "Honey"

Where "DF":

DF = structure(list(a_1 = structure(1:4, .Label = c("Apple", "Grapes", 
"Melon", "Peach"), class = "factor"), a_2 = structure(c(4L, 2L, 
3L, 1L), .Label = c("Honey", "Kiwi", "Lime", "Nuts"), class = "factor"), 
    a_3 = structure(c(2L, 1L, NA, NA), .Label = c("Apple", "Plum"
    ), class = "factor"), a_4 = structure(c(1L, NA, NA, NA), .Label = "Cucumber", class = "factor")), .Names = c("a_1", 
"a_2", "a_3", "a_4"), row.names = c(NA, -4L), class = "data.frame")
like image 23
alexis_laz Avatar answered Nov 07 '22 15:11

alexis_laz