Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select names of columns which contain specific values in row

Tags:

dataframe

r

I'm using a data.frame:

        data.frame("A"=c(NA,5,NA,NA,NA),
                   "B"=c(1,2,3,4,NA),
                   "C"=c(NA,NA,NA,2,3),
                   "D"=c(NA,NA,NA,7,NA))

This delivers a data.frame in this form:

   A  B  C  D
1 NA  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  4  2  7
5 NA NA  3 NA

My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.

The desired output (value greater 2) should be:

for row 1 of the data.frame
x[1,]: c()

for row 2
x[2,]: c("A")

for row3
x[3,]: c("B")

for row4
x[4,]: c("B","D")

and for row5 of the data.frame
x[5,]: c("C")

Thanks for your help!

like image 819
elJorge Avatar asked Jun 23 '13 14:06

elJorge


People also ask

How do I select a column with certain names in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

How do I select a specific row and column in pandas?

To select a particular number of rows and columns, you can do the following using . loc . To select a single value from the DataFrame, you can do the following. You can use slicing to select a particular column.

How do I select only certain columns in a DataFrame?

If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .


2 Answers

You can use which:

lapply(apply(dat, 1, function(x)which(x>2)), names)

with dat being your data frame.

[[1]]
character(0)

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "B" "D"

[[5]]
[1] "C"

EDIT Shorter version suggested by flodel:

lapply(apply(dat > 2, 1, which), names)

Edit: (from Arun)

First, there's no need for lapply and apply. You can get the same just with apply:

apply(dat > 2, 1, function(x) names(which(x)))

But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.

like image 53
user1981275 Avatar answered Oct 22 '22 11:10

user1981275


To answer @flodel's concerns, I'll write it as a separate answer:

1) Using lapply gets a list and apply doesn't guarantee this always:

A fair point. I'll illustrate the issue with an example:

df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA), 
    C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A", 
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")

   A  B  C  D
1  3  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  1  2  7
5 NA NA  3 NA

# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"

So, how can we guarantee a list with apply?

By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:

unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "D"

[[5]]
[1] "C"

2) lapply is overall shorter, and does not require anonymous function:

Yes, but it's slower. Let me illustrate this on a big example.

set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE), 
               ncol = 100))

system.time(t1 <- lapply(apply(df > 2, 1, which), names))
   user  system elapsed 
  5.025   0.342   5.651 

system.time(t2 <- unlist(apply(df, 1, function(x) 
            list(names(which(x>2)))), recursive=FALSE))
   user  system elapsed 
  2.860   0.181   3.065 

identical(t1, t2) # TRUE

3) All answers are wrong and the answer that'll work with all inputs:

lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])

First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.

Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).

# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
   user  system elapsed
517.545   0.312 517.872

Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:

# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy

# or even better using `data.table` `setattr` function to 
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)

Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):

all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
like image 29
Arun Avatar answered Oct 22 '22 13:10

Arun