The efficient way of filtering low frequency data in data frame in R

Question

I have a data.frame with several columns and want to filter low-frequency data according to the combination of the variables. The example is like having Male/Female in Sex variable and High/Low in Cholesterol variable. Then my data frame would be like:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

Now I want to filter the combination of Sex/Age where the frequency is higher than 3

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

Other words, I want to keep the indices for female-high, male-low and male-high.

Notice that 1) my data frame has several variables (not like the example above) and 2) I do not want to use any third R package and 3) I want it to be fast.

talat · Accepted Answer

Here's a simple approach in base R:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

If you have a larger number of variables, you can store them in a vector:

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

And here's a second base R approach using ave:

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)

Georgery · Answer

OK, here is a Base-R option

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

The aggregate function sums up all the 1s for all the combinations.

The efficient way of filtering low frequency data in data frame in R

Tags:

r

MyQ

2 Answers

talat

Georgery

Recent Activity

Donate For Us

The efficient way of filtering low frequency data in data frame in R

Tags:

r

MyQ

2 Answers

talat

Georgery

Related questions

Recent Activity

Donate For Us