Filter top n largest groups in data.frame

For the example data:

set.seed(2222)
example_data <- data.frame(col1 = 1:15,
                           col2 = 16:30, 
                           group = sample(1:3, 15, replace = TRUE))

   col1 col2 group
1     1   16     2
2     2   17     1
3     3   18     3
4     4   19     2
5     5   20     3
6     6   21     1
7     7   22     3
8     8   23     1
9     9   24     3
10   10   25     1
11   11   26     2
12   12   27     2
13   13   28     2
14   14   29     3
15   15   30     3

I want to find the top n groups with the most number of records.

Let's say I want to get the top 2 groups with the most number of records. In the data, this would be group 3 and 2:

example_data %>% 
  group_by(group) %>% 
  summarise(n = n())

# A tibble: 3 x 2
  group     n
  <int> <int>
1     1     4
2     2     5
3     3     6

The expected output is:

   col1 col2 group
1     1   16     2
2     3   18     3
3     4   19     2
4     5   20     3
5     7   22     3
6     9   24     3
7    11   26     2
8    12   27     2
9    13   28     2
10   14   29     3
11   15   30     3

How do I select the top 10 rows in a DataFrame in R?

Let's say, you want to select the first 10 rows. The easiest way to do it would be data[1:10, ] .

How do you get top 5 values in R?

To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.

How will you get the top 2 rows from a DataFrame in pandas?

pandas DataFrame. head() method is used to get the top or bottom N rows of the DataFrame.

We can use table to calculate frequency for each group, sort them in decreasing order, subset the top 2 entries and filter the respective groups.

library(dplyr)

example_data %>%
   filter(group %in% names(sort(table(group), decreasing = TRUE)[1:2]))


#   col1 col2 group
#1     1   16     2
#2     3   18     3
#3     4   19     2
#4     5   20     3
#5     7   22     3
#6     9   24     3
#7    11   26     2
#8    12   27     2
#9    13   28     2
#10   14   29     3
#11   15   30     3

Also you can directly use this in base R subset

subset(example_data, group %in% names(sort(table(group), decreasing = TRUE)[1:2]))

We can use tidyverse methods for this. Create a frequency column with add_count, arrange by that column and filter the rows where the 'group' is in the last two unique 'group' values

library(dplyr)
example_data %>% 
   add_count(group) %>% 
   arrange(n) %>%
   filter(group %in% tail(unique(group), 2)) %>%
   select(-n)
# A tibble: 11 x 3
#    col1  col2 group
#  <int> <int> <int>
# 1     1    16     2
# 2     4    19     2
# 3    11    26     2
# 4    12    27     2
# 5    13    28     2
# 6     3    18     3
# 7     5    20     3
# 8     7    22     3
# 9     9    24     3
#10    14    29     3
#11    15    30     3

Or using data.table

library(data.table)
setDT(example_data)[group %in% example_data[, .N, group][order(-N), head(group, 2)]]

Filter top n largest groups in data.frame

Tags:

r

clemens

People also ask

2 Answers

Ronak Shah

akrun

Recent Activity

Donate For Us

Filter top n largest groups in data.frame

Tags:

r

clemens

People also ask

2 Answers

Ronak Shah

akrun

Related questions

Recent Activity

Donate For Us