I have a data.frame with several columns and want to filter low-frequency data according to the combination of the variables. The example is like having Male/Female in Sex variable and High/Low in Cholesterol variable. Then my data frame would be like:
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
Now I want to filter the combination of Sex/Age where the frequency is higher than 3
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
Other words, I want to keep the indices for female-high, male-low and male-high.
Notice that 1) my data frame has several variables (not like the example above) and 2) I do not want to use any third R package and 3) I want it to be fast.
Here's a simple approach in base R:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
If you have a larger number of variables, you can store them in a vector:
vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
And here's a second base R approach using ave
:
subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
OK, here is a Base-R option
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
merge(
df
, aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
, by = c("Sex", "Age")
)
The aggregate function sum
s up all the 1
s for all the combinations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With