I'm using a data.frame: <pre class="prettyprint"><code> data.frame("A"=c(NA,5,NA,NA,NA), "B"=c(1,2,3,4,NA), "C"=c(NA,NA,NA,2,3), "D"=c(NA,NA,NA,7,NA)) </code></pre> This delivers a data.frame in this form: <pre class="prettyprint"><code> A B C D 1 NA 1 NA NA 2 5 2 NA NA 3 NA 3 NA NA 4 NA 4 2 7 5 NA NA 3 NA </code></pre> My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case. The desired output (value greater 2) should be: <pre class="prettyprint"><code>for row 1 of the data.frame x[1,]: c() for row 2 x[2,]: c("A") for row3 x[3,]: c("B") for row4 x[4,]: c("B","D") and for row5 of the data.frame x[5,]: c("C") </code></pre> Thanks for your help!

You can use <code>which</code>: <pre class="prettyprint"><code>lapply(apply(dat, 1, function(x)which(x>2)), names) </code></pre> with <code>dat</code> being your data frame. <pre class="prettyprint"><code>[[1]] character(0) [[2]] [1] "A" [[3]] [1] "B" [[4]] [1] "B" "D" [[5]] [1] "C" </code></pre> EDIT Shorter version suggested by flodel: <pre class="prettyprint"><code>lapply(apply(dat > 2, 1, which), names) </code></pre> <hr> Edit: (from Arun) First, there's no need for <code>lapply</code> and <code>apply</code>. You can get the same just with <code>apply</code>: <pre class="prettyprint"><code>apply(dat > 2, 1, function(x) names(which(x))) </code></pre> But, using <code>apply</code> on a <code>data.frame</code> will coerce it into a matrix, which may not be wise if the data.frame is huge.

To answer @flodel's concerns, I'll write it as a separate answer: <h3>1) Using <code>lapply</code> gets a list and <code>apply</code> doesn't guarantee this always:</h3> A fair point. I'll illustrate the issue with an example: <pre class="prettyprint"><code>df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA), C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A", "B", "C", "D"), row.names = c(NA, -5L), class = "data.frame") A B C D 1 3 1 NA NA 2 5 2 NA NA 3 NA 3 NA NA 4 NA 1 2 7 5 NA NA 3 NA # using `apply` results in a vector: apply(df, 1, function(x) names(which(x>2))) # [1] "A" "A" "B" "D" "C" </code></pre> So, how can we guarantee a list with <code>apply</code>? By creating a <code>list</code> within the function argument and then use <code>unlist</code> with <code>recursive = FALSE</code>, as shown below: <pre class="prettyprint"><code>unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE) [[1]] [1] "A" [[2]] [1] "A" [[3]] [1] "B" [[4]] [1] "D" [[5]] [1] "C" </code></pre> <h3>2) <code>lapply</code> is overall shorter, and does not require anonymous function:</h3> Yes, but it's slower. Let me illustrate this on a big example. <pre class="prettyprint"><code>set.seed(45) df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE), ncol = 100)) system.time(t1 <- lapply(apply(df > 2, 1, which), names)) user system elapsed 5.025 0.342 5.651 system.time(t2 <- unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)) user system elapsed 2.860 0.181 3.065 identical(t1, t2) # TRUE </code></pre> <h3>3) All answers are wrong and the answer that'll work with all inputs:</h3> <pre class="prettyprint"><code>lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]) </code></pre> First, I don't get as to what's wrong. If you're talking about the list being <code>unnamed</code>, this can be changed by just setting the names just once at the end. Second, unfortunately, using <code>split</code> on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels). <pre class="prettyprint"><code># testing on huge data.frame system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])) user system elapsed 517.545 0.312 517.872 </code></pre> Third, this orders the elements as <code>1, 10, 100, 1000, 10000, 100000, ...</code> instead of <code>1 .. 1e5</code>. Instead one could just use <code>setNames</code> or <code>setnames</code> (from <code>data.table</code> package) to just do this once finally, as shown below: <pre class="prettyprint"><code># setting names just once t2 <- setNames(t2, rownames(df)) # by copy # or even better using `data.table` `setattr` function to # set names by reference require(data.table) tracemem(t2) setattr(t2, 'names', rownames(df)) tracemem(t2) </code></pre> Comparing the output doesn't show any other difference between the two (<code>t3</code> and <code>t2</code>). You could run this to verify that the outputs are same (time consuming): <pre class="prettyprint"><code>all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE </code></pre>

Select names of columns which contain specific values in row

Tags:

dataframe

r

I'm using a data.frame:

        data.frame("A"=c(NA,5,NA,NA,NA),
                   "B"=c(1,2,3,4,NA),
                   "C"=c(NA,NA,NA,2,3),
                   "D"=c(NA,NA,NA,7,NA))

This delivers a data.frame in this form:

   A  B  C  D
1 NA  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  4  2  7
5 NA NA  3 NA

My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.

The desired output (value greater 2) should be:

for row 1 of the data.frame
x[1,]: c()

for row 2
x[2,]: c("A")

for row3
x[3,]: c("B")

for row4
x[4,]: c("B","D")

and for row5 of the data.frame
x[5,]: c("C")

Thanks for your help!

819

asked Jun 23 '13 14:06

elJorge

2 Answers

You can use which:

lapply(apply(dat, 1, function(x)which(x>2)), names)

with dat being your data frame.

[[1]]
character(0)

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "B" "D"

[[5]]
[1] "C"

EDIT Shorter version suggested by flodel:

lapply(apply(dat > 2, 1, which), names)

Edit: (from Arun)

First, there's no need for lapply and apply. You can get the same just with apply:

apply(dat > 2, 1, function(x) names(which(x)))

But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.

answered Oct 22 '22 11:10

user1981275

To answer @flodel's concerns, I'll write it as a separate answer:

1) Using `lapply` gets a list and `apply` doesn't guarantee this always:

A fair point. I'll illustrate the issue with an example:

df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA), 
    C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A", 
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")

   A  B  C  D
1  3  1 NA NA
2  5  2 NA NA
3 NA  3 NA NA
4 NA  1  2  7
5 NA NA  3 NA

# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"

So, how can we guarantee a list with apply?

By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:

unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"

[[2]]
[1] "A"

[[3]]
[1] "B"

[[4]]
[1] "D"

[[5]]
[1] "C"

2) `lapply` is overall shorter, and does not require anonymous function:

Yes, but it's slower. Let me illustrate this on a big example.

set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE), 
               ncol = 100))

system.time(t1 <- lapply(apply(df > 2, 1, which), names))
   user  system elapsed 
  5.025   0.342   5.651 

system.time(t2 <- unlist(apply(df, 1, function(x) 
            list(names(which(x>2)))), recursive=FALSE))
   user  system elapsed 
  2.860   0.181   3.065 

identical(t1, t2) # TRUE

3) All answers are wrong and the answer that'll work with all inputs:

lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])

First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.

Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).

# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
   user  system elapsed
517.545   0.312 517.872

Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:

# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy

# or even better using `data.table` `setattr` function to 
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)

Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):

all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE

answered Oct 22 '22 13:10

Arun

Related questions
                            
                                How to Vectorize this R code Using Plyr, Apply, or Similar?
                            
                                Compute the time since the beginning of the week?
                            
                                R zero or one based? [closed]
                            
                                Fastest way to split character vectors into new rows in a dataframe
                            
                                Stacked bar plot in R with multiple rows per day
                            
                                R: Calculate the mean value of a variable by unique values of another variable in a dataframe?
                            
                                Compile r with mkl (With mulithreads support)
                            
                                Plotting decision trees in R with rpart
                            
                                Subtraction on different rows and columns and separated by group
                            
                                Using lm and predict on data in matrices
                            
                                Convert Date to year month representation
                            
                                Isolate the significance column from summary(aov()) in r
                            
                                is there a concept of Shortcuts/Alias/Pointer in R?
                            
                                Replace string unless between two points
                            
                                Change text color for single facets in ggplot2
                            
                                Optimizing for Vector Using Optimize R
                            
                                How to continue function when error is thrown in withCallingHandlers in R
                            
                                R data.table replacing an index of values from another data.table
                            
                                Collapse runs of consecutive numbers to ranges
                            
                                Why is R's implementation of the Douglas-Peucker algorithm so slow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select names of columns which contain specific values in row

Tags:

dataframe

r

elJorge

People also ask

2 Answers

user1981275

1) Using `lapply` gets a list and `apply` doesn't guarantee this always:

2) `lapply` is overall shorter, and does not require anonymous function:

3) All answers are wrong and the answer that'll work with all inputs:

Arun

Recent Activity

Donate For Us

Select names of columns which contain specific values in row

Tags:

dataframe

r

elJorge

People also ask

2 Answers

user1981275

1) Using lapply gets a list and apply doesn't guarantee this always:

2) lapply is overall shorter, and does not require anonymous function:

3) All answers are wrong and the answer that'll work with all inputs:

Arun

Related questions

Recent Activity

Donate For Us

1) Using `lapply` gets a list and `apply` doesn't guarantee this always:

2) `lapply` is overall shorter, and does not require anonymous function: