Simple way to delete dataframe rows robust to instances where no rows match deletion criteria

Tags:

r

One common task in data manipulation in R is subseting a dataframe by removing rows that match a certain criteria. However, the simple way to do this in R seems logically inconsistent and even dangerous to the unexperienced (like myself).

Lets say we have a data frame and we want to exclude rows that belong to the "G1" treatment:

Treatment=c("G1","G1","G1","G1","G1","G1","G2","G2","G2","G2","G2",
"G2","G3","G3","G3","G3","G3","G3")
Vals=c(runif(6),runif(6)+0.9,runif(6)-0.3)
data=data.frame(Treatment)
data=cbind(data, Vals)

As expected, the code below removes the dataframe rows that match the criteria of the first line

to_del=which(data$Treatment=="G1")
new_data=data[-to_del,]
new_data

However, contrary to expected, using this approach if the 'which' command does not find ANY matching row this code removes all rows instead of leaving them all alone

to_del=which(data$Treatment=="G4")
new_data=data[-to_del,]
new_data

The code above results in a data frame with no rows left, which makes no sense (i.e., since R found no rows that match my criteria for deletion, it deleted all rows). My work-around does the job but I would imagine there is a simpler way to do this without all of these conditional statements

###WORKAROUND
to_del=which(data$Treatment=="G4") #no G4 treatment in this particular data frame
if (length(to_del)>0){
  new_data=data[-to_del,]  
}else{
  new_data=data
}
new_data

Does anyone have a simple way to do this that works even when no rows match specified criteria?

561

asked Feb 15 '13 21:02

Lucas Fortini

4 Answers

You've stumbled on to a common issue with using which. Use != instead.

new_data <- data[data$Treatment!="G4",]

The problem is that which returns integer(0) if all the elements are FALSE. This would still be an issue even if which returned 0 because subsetting by zero also returns integer(0):

R> # subsetting by zero (positive or negative)
R> (1:3)[0]  # same as (1:3)[-0]
integer(0)

You will also run into issues if you subset by NA:

R> # subsetting by NA
R> (1:3)[NA]
[1] NA NA NA

159

answered Oct 09 '22 02:10

Joshua Ulrich

Why not use subset?

subset(data,  ! rownames(data) %in% to_del )

(You were implicitly matching to rownames in the data[-to_del, ] examples, anyway.) Of course once that works you can go back to using just "["

data[  ! rownames(data) %in% to_del , ]

answered Oct 09 '22 01:10

IRTFM

I like to use data.table for subsetting, since it is more intuitive, shorter, and runs quicker with large data sets.

library(data.table)
data.dt<-as.data.table(data)
setkey(data.dt, Treatment)

data.dt[!"G1",]
##     Treatment        Vals
##  1:        G2  0.90264622
##  2:        G2  1.47842130
##  3:        G2  1.52494735
##  4:        G2  1.46373958
##  5:        G2  1.12850658
##  6:        G2  1.46705561
##  7:        G3  0.58451869
##  8:        G3 -0.20231228
##  9:        G3  0.52519475
## 10:        G3  0.62956475
## 11:        G3 -0.06655426
## 12:        G3  0.56814703

data.dt[!"G4",]
##    Treatment        Vals
## 1         G1  0.93411692
## 2         G1  0.60153972
## 3         G1  0.28147464
## 4         G1  0.97264924
## 5         G1  0.50804831
## 6         G1  0.48273876
## 7         G2  0.90264622
## 8         G2  1.47842130
## 9         G2  1.52494735
## 10        G2  1.46373958
## 11        G2  1.12850658
## 12        G2  1.46705561
## 13        G3  0.58451869
## 14        G3 -0.20231228
## 15        G3  0.52519475
## 16        G3  0.62956475
## 17        G3 -0.06655426
## 18        G3  0.56814703

Note that if you subset on a column that has not been set as the key, then you need to use the column name in the subset (e.g. data.dt[Vals<0,])

I think the creators of data.table may be working on a way to directly delete the rows from the original table, rather than having to copy all the non-deleted rows to a new table and then delete the original table. This will be a great help when you're running into memory limits.

answered Oct 09 '22 02:10

dnlbrky

The issue is that you are not selecting which rows to DELETE you are selecting which rows to KEEP. And as you've found out, you can often interchange these concepts, but sometimes, there are issues.

Specifically, when you use which you are asking R "which elements of this vector are true". However, when it finds none, it indicates this by returning integer(0).

Integer(0) is not an actual number, and hence taking the negative of Integer(0) still gives Integer(0).

However, there is no need to use which, if you are going to simply use it to filter.

Instead, take the statement that you are passing to which and pass it directly as a filter to data[..]. Recall that you can use a logical vector as an index just as well as an integer vector.

answered Oct 09 '22 01:10

Ricardo Saporta

Related questions
                            
                                How to create a binary vector with 1 if elements are part of the same vector?
                            
                                Leaving RScript-produced plots on screen until user interaction
                            
                                Significance testing in R, determining if the proportion in one column is significantly different from the other column within the single variable
                            
                                big.matrix as data.frame in R
                            
                                How to incorporate updated line colours into legend of a plot in R using lattice?
                            
                                R: Unused argument "label" in hclust
                            
                                Subtracting Two Columns Consisting of Both Date and Time in R
                            
                                Show two symbols for each legend label
                            
                                R- Create a single date from multiple columns
                            
                                R identifying a row prior to a change in sign
                            
                                Selecting rows in data.frame based on character strings
                            
                                How to control font size in png?
                            
                                colMeans function in R and running into problems with columns of size 1
                            
                                View large data set on the R console
                            
                                R extract time components from semi-standard strings
                            
                                How do I get my R buffer in emacs to occupy more horizontal space?
                            
                                reshape dataframe based on a string split in one column in R
                            
                                Selectively Modify Indices
                            
                                Removing NA columns in xts
                            
                                How to get something like Matplotlib's symlog scale in ggplot or lattice?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With