Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter top n largest groups in data.frame

Tags:

r

For the example data:

set.seed(2222)
example_data <- data.frame(col1 = 1:15,
                           col2 = 16:30, 
                           group = sample(1:3, 15, replace = TRUE))

   col1 col2 group
1     1   16     2
2     2   17     1
3     3   18     3
4     4   19     2
5     5   20     3
6     6   21     1
7     7   22     3
8     8   23     1
9     9   24     3
10   10   25     1
11   11   26     2
12   12   27     2
13   13   28     2
14   14   29     3
15   15   30     3

I want to find the top n groups with the most number of records.

Let's say I want to get the top 2 groups with the most number of records. In the data, this would be group 3 and 2:

example_data %>% 
  group_by(group) %>% 
  summarise(n = n())

# A tibble: 3 x 2
  group     n
  <int> <int>
1     1     4
2     2     5
3     3     6

The expected output is:

   col1 col2 group
1     1   16     2
2     3   18     3
3     4   19     2
4     5   20     3
5     7   22     3
6     9   24     3
7    11   26     2
8    12   27     2
9    13   28     2
10   14   29     3
11   15   30     3
like image 587
clemens Avatar asked May 29 '19 14:05

clemens


People also ask

How do I select the top 10 rows in a DataFrame in R?

Let's say, you want to select the first 10 rows. The easiest way to do it would be data[1:10, ] .

How do you get top 5 values in R?

To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.

How will you get the top 2 rows from a DataFrame in pandas?

pandas DataFrame. head() method is used to get the top or bottom N rows of the DataFrame.


2 Answers

We can use table to calculate frequency for each group, sort them in decreasing order, subset the top 2 entries and filter the respective groups.

library(dplyr)

example_data %>%
   filter(group %in% names(sort(table(group), decreasing = TRUE)[1:2]))


#   col1 col2 group
#1     1   16     2
#2     3   18     3
#3     4   19     2
#4     5   20     3
#5     7   22     3
#6     9   24     3
#7    11   26     2
#8    12   27     2
#9    13   28     2
#10   14   29     3
#11   15   30     3

Also you can directly use this in base R subset

subset(example_data, group %in% names(sort(table(group), decreasing = TRUE)[1:2]))
like image 80
Ronak Shah Avatar answered Sep 20 '22 01:09

Ronak Shah


We can use tidyverse methods for this. Create a frequency column with add_count, arrange by that column and filter the rows where the 'group' is in the last two unique 'group' values

library(dplyr)
example_data %>% 
   add_count(group) %>% 
   arrange(n) %>%
   filter(group %in% tail(unique(group), 2)) %>%
   select(-n)
# A tibble: 11 x 3
#    col1  col2 group
#  <int> <int> <int>
# 1     1    16     2
# 2     4    19     2
# 3    11    26     2
# 4    12    27     2
# 5    13    28     2
# 6     3    18     3
# 7     5    20     3
# 8     7    22     3
# 9     9    24     3
#10    14    29     3
#11    15    30     3

Or using data.table

library(data.table)
setDT(example_data)[group %in% example_data[, .N, group][order(-N), head(group, 2)]]
like image 35
akrun Avatar answered Sep 22 '22 01:09

akrun