I've searched on SO trying to find a solution to no avail. So here it is. I have a data frame with many columns, some of which are numerical and should be non-negative. I want to clean the data since some values in these numerical columns are negative. What I can do now is extract the column names of these columns with a regular expression. But I am not sure how to implement the filtering of rows based on these columns.
To give an example, let's say:
library(dplyr)
df <- read.table(text =
"id sth1 tg1_num sth2 tg2_num others
1 dave 2 ca 35 new
2 tom 5 tn -3 old
3 jane -3 al 0 new
4 leroy 0 az 25 old
5 jerry 4 mi 55 old", header=TRUE)
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
df <- df %>% filter(target_columns >= 0) # it's is wrong, but it's what I want to do
What I want to get out from this filtering is the following:
id sth1 tg1_num sth2 tg2_num others
1 dave 2 ca 35 new
4 leroy 0 az 25 old
5 jerry 4 mi 55 old
where rows no. 2 and 3 are filtered out because at least one column in tg1_num and tg2_num for these rows contain negative numbers.
Here's a possible vectorized solution
ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
# id sth1 tg1_num sth2 tg2_num others
# 1 1 dave 2 ca 35 new
# 4 4 leroy 0 az 25 old
# 5 5 jerry 4 mi 55 old
The idea here is to create a logical matrix using the <
function (it is a generic function which has data.frame
method - which means it returns a data frame like structure back). Then, we are using rowSums
to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the !
function in order to convert it to a logical vector: >0 becomes TRUE
, while 0 becomes FALSE
. Finally, we are subsetting according to that vector.
This is a very awkward use of dplyr
, but might be true to the spirit
> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
From there you can add a filter(m >= 0)
to get the answer you want. If there were a rowMins
analogous to rowMeans
then that would simplify this significantly.
> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
I don't know how efficient this is, though. And nesting the select
seems real ugly.
EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)
microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))
Unit: microseconds
expr min lq mean median uq max
rowmins(df) 1373.452 1431.9700 1732.188 1576.043 1729.410 5147.847
rowmins2(df) 836.885 875.9900 1015.364 913.285 1038.729 2510.339
reducer(df) 990.096 1058.6645 1217.264 1201.159 1297.997 3103.809
sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
tchotchke(df) 2770.818 2939.6425 3114.233 3036.926 3172.325 4098.161
withrowsums(df) 1526.227 1627.8185 1819.220 1722.430 1876.360 3025.095
reducer2(df) 900.524 943.1265 1087.025 1003.820 1109.188 3869.993
And here are the definitions I used
rowmins <- function(df) {
df %>%
mutate(m = rowMins(select(df, ends_with("_num")))) %>%
filter(m >= 0) %>%
select(-m)
}
rowmins2 <- function(df) {
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}
reducer <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
which %>%
slice(.data = df)
}
reducer2 <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
{df[.,]}
}
sapplyer <- function(df) {
nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}
grepapply <- function(df) {
cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}
tchotchke <- function(df) {
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
as.vector(unique(unlist(desired_rows)))
}
withrowsums <- function(df) {
df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}
df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With