Suppose I have a data frame (df) that looks like below:
options(stringsAsFactors = F)
cars <- c("Car1", "Car2", "Car3", "Car4", "Car5", "Car6", "Car7", "Car8", "Car9")
test1 <- c(0,0,3,1,4,2,1,3,0)
test2 <- c(0,0,2,1,0,2,2,5,0)
test3 <- c(1,0,5,1,2,2,6,7,0)
test4 <- c(2,NA,2,1,2,2,1,1,0)
test5 <- c(0,0,1,1,0,2,1,3,0)
test6 <- c(1,0,1,1,1,2,3,4,0)
test7 <- c(3,0,2,1,0,2,1,1,0)
df <- data.frame(cars,test1,test2,test3,test4,test5,test6,test7)
#df
cars test1 test2 test3 test4 test5 test6 test7
#1 Car1 0 0 1 2 0 1 3
#2 Car2 0 0 0 NA 0 0 0
#3 Car3 3 2 5 2 1 1 2
#4 Car4 1 1 1 1 1 1 1
#5 Car5 4 0 2 2 0 1 0
#6 Car6 2 2 2 2 2 2 2
#7 Car7 1 2 6 1 1 3 1
#8 Car8 3 5 7 1 3 4 1
#9 Car9 0 0 0 0 0 0 0
I want to remove any rows that have the same value throughout the entire row (in the example above, I would like to keep rows 1, 3, 5, 7, 8 and remove the rest).
I've figured out how to remove all rows that have zeros
df$sum <- rowSums(df[,c(2:8)], na.rm = T )
df.all0 <- df[which(df$sum == 0),]
However, this doesn't necessarily work for all the other rows. Unlike other questions, this question asks to look for duplicates across the entire row, not just specific columns.
Any help would be greatly appreciated!
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
You can use pandas. concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.
Press Ctrl + A to select all of them. You can select specific values you want to remove by using Ctrl or Shift keys. Close the Find and Replace window. Click OK button to delete those rows.
keep <- apply(df[2:8], 1, function(x) length(unique(x[!is.na(x)])) != 1)
df[keep, ]
cars test1 test2 test3 test4 test5 test6 test7
1 Car1 0 0 1 2 0 1 3
3 Car3 3 2 5 2 1 1 2
5 Car5 4 0 2 2 0 1 0
7 Car7 1 2 6 1 1 3 1
8 Car8 3 5 7 1 3 4 1
We can also use Map
with Reduce
df[c(Reduce(`+`, Map(function(x,y) x != y & !is.na(x), df[-1], list(df[2]))) != 0),]
# cars test1 test2 test3 test4 test5 test6 test7
#1 Car1 0 0 1 2 0 1 3
#3 Car3 3 2 5 2 1 1 2
#5 Car5 4 0 2 2 0 1 0
#7 Car7 1 2 6 1 1 3 1
#8 Car8 3 5 7 1 3 4 1
Or using tidyverse
library(tidyverse)
df %>%
filter_at(vars(starts_with("test")), any_vars((. != test1)))
# cars test1 test2 test3 test4 test5 test6 test7
#1 Car1 0 0 1 2 0 1 3
#2 Car3 3 2 5 2 1 1 2
#3 Car5 4 0 2 2 0 1 0
#4 Car7 1 2 6 1 1 3 1
#5 Car8 3 5 7 1 3 4 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With