I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
To select a particular number of rows and columns, you can do the following using . loc . To select a single value from the DataFrame, you can do the following. You can use slicing to select a particular column.
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .
You can use which
:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat
being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply
and apply
. You can get the same just with apply
:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply
on a data.frame
will coerce it into a matrix, which may not be wise if the data.frame is huge.
To answer @flodel's concerns, I'll write it as a separate answer:
lapply
gets a list and apply
doesn't guarantee this always:A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply
?
By creating a list
within the function argument and then use unlist
with recursive = FALSE
, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
lapply
is overall shorter, and does not require anonymous function:Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed
, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split
on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ...
instead of 1 .. 1e5
. Instead one could just use setNames
or setnames
(from data.table
package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3
and t2
). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With