Subset data frame based on top N most frequent values in variable

Question

My objective is to create a simple density or barplot of a long dataframe which shows the relative frequency of nationalities in a course (MOOC). I just don't want all of the nationalities in there, just the top 10. I created this example df below + the ggplot2 code I use for plotting.

d=data.frame(course=sample(LETTERS[1:5], 500,replace=T),nationality=as.factor(sample(1:172,500,replace=T)))
mm <- ggplot(d, aes(x=nationality, colour=factor(course)))
mm + geom_bar() + theme_classic()

...but as said: I want a subset of the entire dataset based on frequency. The above shows all data.

PS. I added the ggplot2 code for context but also because maybe there is something within ggplot2 itself that would make this possible (I doubt it however).

EDIT 2014-12-11: The current answers use ddplyr or table methods to arrive at the desired subset, but I wonder if there is not a more direct way to achieve the same.. I will let it stay for now, see if there are other ways.

Henrik · Accepted Answer

Using dplyr functions count and top_n to get top-10 nationalities. Because top_n accounts for ties, the number of nationalities included in this example are more than 10 due to ties. arrange the counts, use factor and levels to set nationalities in descending order.

# top-10 nationalities
d2 <- d %>%
  count(nationality) %>%
  top_n(10) %>%
  arrange(n, nationality) %>%
  mutate(nationality = factor(nationality, levels = unique(nationality)))

d %>%
  filter(nationality %in% d2$nationality) %>%
  mutate(nationality = factor(nationality, levels = levels(d2$nationality))) %>%
  ggplot(aes(x = nationality, fill = course)) +
    geom_bar()

enter image description here

Sven Hohenstein · Answer

Here's an approach to select the top 10 nationalities. Note that multiple nationalities share the same frequency. Therefore, selecting the top 10 results in omitting some nationalities with the same frequency.

# calculate frequencies
tab <- table(d$nationality)
# sort
tab_s <- sort(tab)
# extract 10 most frequent nationalities
top10 <- tail(names(tab_s), 10)
# subset of data frame
d_s <- subset(d, nationality %in% top10)
# order factor levels
d_s$nationality <- factor(d_s$nationality, levels = rev(top10))

# plot
ggplot(d_s, aes(x = nationality, fill = as.factor(course))) +
  geom_bar() + 
  theme_classic()

Note that I changed colour to fill since colour affects the colour of the border.

enter image description here

Subset data frame based on top N most frequent values in variable

Tags:

r

count

ggplot2

ranking

subset

Thieme Hennis

2 Answers

Henrik

Sven Hohenstein

Recent Activity

Donate For Us

Subset data frame based on top N most frequent values in variable

Tags:

r

count

ggplot2

ranking

subset

Thieme Hennis

2 Answers

Henrik

Sven Hohenstein

Related questions

Recent Activity

Donate For Us