Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add labels for selected observations in ggplot2 histogram at the same height as the bins

Tags:

r

ggplot2

I'd like to add an "id" annotation to certain observations in a histogram.

So far, I'm able to add the annotation with no problem, but I'd like the 'y' position of my annotations to be the count of the bin + 1 (for aesthetic reasons).

This is what I have so far:

library(tidyverse)
library(ggrepel)

selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)
plot_df <- tibble(id = paste0("S", 1:1000),
                  values = values) %>%
    mutate(obs_labels = ifelse(id %in% selected_obs, id, NA))

ggplot(plot_df, aes(values)) +
    geom_histogram(binwidth = 0.3, color = "white") +
    geom_label_repel(aes(label = obs_labels, y = 100))

enter image description here

I've seen multiple answers dealing with annotating the count for each bin using geom_text(stat = count", aes(y=..count.., label=..count..).

Based on that, I've tried these two work-arounds, but no success:

  1. geom_label_repel(stat = "count", aes(label = obs_labels, y = ..count..)) yields: "Error: geom_label_repel requires the following missing aesthetics: label"
  2. geom_label_repel(aes(label = obs_labels, y = ..count..)) yields "Error: Aesthetics must be valid computed stats. Problematic aesthetic(s): y = ..count... Did you map your stat in the wrong layer?".

Anybody that can shed some light here?

like image 645
csgroen Avatar asked Nov 06 '22 09:11

csgroen


1 Answers

That may be a mildly misleading visualisation, because you are labelling a unique ID, but with the positioning of this label to the count height you are suggesting that this ID was counted that often. Anyways.

The most straight forward option is to manually calculate the bin to which your ID belongs, then count this bin, and then use this data in order to set the x and y for your labels.

Unfortunately, I have to use R online and cannot create a nice reprex, therefore including a screenshot. But the code should be reproducible, as it is running online

library(tidyverse)
library(ggrepel)

selected_obs <- c("S10", "S100", "S245", "S900")
set.seed(0)
values <- rnorm(1000)

plot_df <- tibble(id = paste0("S", 1:1000),
                  values = values) %>%
    mutate(obs_labels = ifelse(id %in% selected_obs, id, NA),
bins = as.factor( as.numeric( cut(values, 30)))) # cutting into 30 bins

label_df<- plot_df %>% filter(id %in% selected_obs) %>% left_join(plot_df, by = 'bins') %>% 
group_by(values = values.x, obs_labels = obs_labels.x) %>% count

ggplot(plot_df, aes(values)) +
    geom_histogram(color = "white") + # removed your bin argument, as to default to 30 
    geom_label(data = label_df, aes(label = obs_labels, y = n))

enter image description here

The label positions are not quite perfect - this is because I chose to cut into 30 equal bins and the binning may be slightly different between cut and histogram. This may need some tweaking, depending on the size of your bins, and if you include upper/lower margins.

P.S. Credit to cut into equal bins goes to this answer by user pedrosaurio

like image 179
tjebo Avatar answered Nov 11 '22 07:11

tjebo