Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple way to delete dataframe rows robust to instances where no rows match deletion criteria

Tags:

r

One common task in data manipulation in R is subseting a dataframe by removing rows that match a certain criteria. However, the simple way to do this in R seems logically inconsistent and even dangerous to the unexperienced (like myself).

Lets say we have a data frame and we want to exclude rows that belong to the "G1" treatment:

Treatment=c("G1","G1","G1","G1","G1","G1","G2","G2","G2","G2","G2",
"G2","G3","G3","G3","G3","G3","G3")
Vals=c(runif(6),runif(6)+0.9,runif(6)-0.3)
data=data.frame(Treatment)
data=cbind(data, Vals)  

As expected, the code below removes the dataframe rows that match the criteria of the first line

to_del=which(data$Treatment=="G1")
new_data=data[-to_del,]
new_data

However, contrary to expected, using this approach if the 'which' command does not find ANY matching row this code removes all rows instead of leaving them all alone

to_del=which(data$Treatment=="G4")
new_data=data[-to_del,]
new_data

The code above results in a data frame with no rows left, which makes no sense (i.e., since R found no rows that match my criteria for deletion, it deleted all rows). My work-around does the job but I would imagine there is a simpler way to do this without all of these conditional statements

###WORKAROUND
to_del=which(data$Treatment=="G4") #no G4 treatment in this particular data frame
if (length(to_del)>0){
  new_data=data[-to_del,]  
}else{
  new_data=data
}
new_data

Does anyone have a simple way to do this that works even when no rows match specified criteria?

like image 561
Lucas Fortini Avatar asked Feb 15 '13 21:02

Lucas Fortini


People also ask

How do I remove rows from a DataFrame in NA?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

How do I remove rows from a DataFrame based on conditions in R?

For example, we can use the subset() function if we want to drop a row based on a condition. If we prefer to work with the Tidyverse package, we can use the filter() function to remove (or select) rows based on values in a column (conditionally, that is, and the same as using subset).

How do I delete a row in a data frame?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.


4 Answers

You've stumbled on to a common issue with using which. Use != instead.

new_data <- data[data$Treatment!="G4",]

The problem is that which returns integer(0) if all the elements are FALSE. This would still be an issue even if which returned 0 because subsetting by zero also returns integer(0):

R> # subsetting by zero (positive or negative)
R> (1:3)[0]  # same as (1:3)[-0]
integer(0)

You will also run into issues if you subset by NA:

R> # subsetting by NA
R> (1:3)[NA]
[1] NA NA NA
like image 159
Joshua Ulrich Avatar answered Oct 09 '22 02:10

Joshua Ulrich


Why not use subset?

subset(data,  ! rownames(data) %in% to_del )

(You were implicitly matching to rownames in the data[-to_del, ] examples, anyway.) Of course once that works you can go back to using just "["

data[  ! rownames(data) %in% to_del , ]
like image 20
IRTFM Avatar answered Oct 09 '22 01:10

IRTFM


I like to use data.table for subsetting, since it is more intuitive, shorter, and runs quicker with large data sets.

library(data.table)
data.dt<-as.data.table(data)
setkey(data.dt, Treatment)

data.dt[!"G1",]
##     Treatment        Vals
##  1:        G2  0.90264622
##  2:        G2  1.47842130
##  3:        G2  1.52494735
##  4:        G2  1.46373958
##  5:        G2  1.12850658
##  6:        G2  1.46705561
##  7:        G3  0.58451869
##  8:        G3 -0.20231228
##  9:        G3  0.52519475
## 10:        G3  0.62956475
## 11:        G3 -0.06655426
## 12:        G3  0.56814703

data.dt[!"G4",]
##    Treatment        Vals
## 1         G1  0.93411692
## 2         G1  0.60153972
## 3         G1  0.28147464
## 4         G1  0.97264924
## 5         G1  0.50804831
## 6         G1  0.48273876
## 7         G2  0.90264622
## 8         G2  1.47842130
## 9         G2  1.52494735
## 10        G2  1.46373958
## 11        G2  1.12850658
## 12        G2  1.46705561
## 13        G3  0.58451869
## 14        G3 -0.20231228
## 15        G3  0.52519475
## 16        G3  0.62956475
## 17        G3 -0.06655426
## 18        G3  0.56814703

Note that if you subset on a column that has not been set as the key, then you need to use the column name in the subset (e.g. data.dt[Vals<0,])

I think the creators of data.table may be working on a way to directly delete the rows from the original table, rather than having to copy all the non-deleted rows to a new table and then delete the original table. This will be a great help when you're running into memory limits.

like image 24
dnlbrky Avatar answered Oct 09 '22 02:10

dnlbrky


The issue is that you are not selecting which rows to DELETE you are selecting which rows to KEEP. And as you've found out, you can often interchange these concepts, but sometimes, there are issues.

Specifically, when you use which you are asking R "which elements of this vector are true". However, when it finds none, it indicates this by returning integer(0).

Integer(0) is not an actual number, and hence taking the negative of Integer(0) still gives Integer(0).

However, there is no need to use which, if you are going to simply use it to filter.

Instead, take the statement that you are passing to which and pass it directly as a filter to data[..]. Recall that you can use a logical vector as an index just as well as an integer vector.

like image 24
Ricardo Saporta Avatar answered Oct 09 '22 01:10

Ricardo Saporta