Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The efficient way of filtering low frequency data in data frame in R

Tags:

r

I have a data.frame with several columns and want to filter low-frequency data according to the combination of the variables. The example is like having Male/Female in Sex variable and High/Low in Cholesterol variable. Then my data frame would be like:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

Now I want to filter the combination of Sex/Age where the frequency is higher than 3

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

Other words, I want to keep the indices for female-high, male-low and male-high.

Notice that 1) my data frame has several variables (not like the example above) and 2) I do not want to use any third R package and 3) I want it to be fast.

like image 627
MyQ Avatar asked Dec 01 '22 10:12

MyQ


2 Answers

Here's a simple approach in base R:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

If you have a larger number of variables, you can store them in a vector:

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

And here's a second base R approach using ave:

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
like image 66
talat Avatar answered Dec 04 '22 12:12

talat


OK, here is a Base-R option

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

The aggregate function sums up all the 1s for all the combinations.

like image 35
Georgery Avatar answered Dec 04 '22 13:12

Georgery