One common task in data manipulation in R is subseting a dataframe by removing rows that match a certain criteria. However, the simple way to do this in R seems logically inconsistent and even dangerous to the unexperienced (like myself).
Lets say we have a data frame and we want to exclude rows that belong to the "G1" treatment:
Treatment=c("G1","G1","G1","G1","G1","G1","G2","G2","G2","G2","G2",
"G2","G3","G3","G3","G3","G3","G3")
Vals=c(runif(6),runif(6)+0.9,runif(6)-0.3)
data=data.frame(Treatment)
data=cbind(data, Vals)
As expected, the code below removes the dataframe rows that match the criteria of the first line
to_del=which(data$Treatment=="G1")
new_data=data[-to_del,]
new_data
However, contrary to expected, using this approach if the 'which' command does not find ANY matching row this code removes all rows instead of leaving them all alone
to_del=which(data$Treatment=="G4")
new_data=data[-to_del,]
new_data
The code above results in a data frame with no rows left, which makes no sense (i.e., since R found no rows that match my criteria for deletion, it deleted all rows). My work-around does the job but I would imagine there is a simpler way to do this without all of these conditional statements
###WORKAROUND
to_del=which(data$Treatment=="G4") #no G4 treatment in this particular data frame
if (length(to_del)>0){
new_data=data[-to_del,]
}else{
new_data=data
}
new_data
Does anyone have a simple way to do this that works even when no rows match specified criteria?
To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).
For example, we can use the subset() function if we want to drop a row based on a condition. If we prefer to work with the Tidyverse package, we can use the filter() function to remove (or select) rows based on values in a column (conditionally, that is, and the same as using subset).
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
You've stumbled on to a common issue with using which
. Use !=
instead.
new_data <- data[data$Treatment!="G4",]
The problem is that which
returns integer(0)
if all the elements are FALSE
. This would still be an issue even if which
returned 0
because subsetting by zero also returns integer(0)
:
R> # subsetting by zero (positive or negative)
R> (1:3)[0] # same as (1:3)[-0]
integer(0)
You will also run into issues if you subset by NA
:
R> # subsetting by NA
R> (1:3)[NA]
[1] NA NA NA
Why not use subset
?
subset(data, ! rownames(data) %in% to_del )
(You were implicitly matching to rownames in the data[-to_del, ]
examples, anyway.)
Of course once that works you can go back to using just "["
data[ ! rownames(data) %in% to_del , ]
I like to use data.table
for subsetting, since it is more intuitive, shorter, and runs quicker with large data sets.
library(data.table)
data.dt<-as.data.table(data)
setkey(data.dt, Treatment)
data.dt[!"G1",]
## Treatment Vals
## 1: G2 0.90264622
## 2: G2 1.47842130
## 3: G2 1.52494735
## 4: G2 1.46373958
## 5: G2 1.12850658
## 6: G2 1.46705561
## 7: G3 0.58451869
## 8: G3 -0.20231228
## 9: G3 0.52519475
## 10: G3 0.62956475
## 11: G3 -0.06655426
## 12: G3 0.56814703
data.dt[!"G4",]
## Treatment Vals
## 1 G1 0.93411692
## 2 G1 0.60153972
## 3 G1 0.28147464
## 4 G1 0.97264924
## 5 G1 0.50804831
## 6 G1 0.48273876
## 7 G2 0.90264622
## 8 G2 1.47842130
## 9 G2 1.52494735
## 10 G2 1.46373958
## 11 G2 1.12850658
## 12 G2 1.46705561
## 13 G3 0.58451869
## 14 G3 -0.20231228
## 15 G3 0.52519475
## 16 G3 0.62956475
## 17 G3 -0.06655426
## 18 G3 0.56814703
Note that if you subset on a column that has not been set as the key, then you need to use the column name in the subset (e.g. data.dt[Vals<0,]
)
I think the creators of data.table
may be working on a way to directly delete the rows from the original table, rather than having to copy all the non-deleted rows to a new table and then delete the original table. This will be a great help when you're running into memory limits.
The issue is that you are not selecting which rows to DELETE you are selecting which rows to KEEP. And as you've found out, you can often interchange these concepts, but sometimes, there are issues.
Specifically, when you use which
you are asking R "which elements of this vector are true". However, when it finds none, it indicates this by returning integer(0)
.
Integer(0) is not an actual number, and hence taking the negative of Integer(0) still gives Integer(0).
However, there is no need to use which, if you are going to simply use it to filter.
Instead, take the statement that you are passing to which
and pass it directly as a filter to data[..]
. Recall that you can use a logical vector as an index just as well as an integer vector.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With